Build a MetaReader subclass

class poriscope.utils.MetaReader.MetaReader(settings: dict | None = None)

MetaReader is the base class for all things related to reading raw nanopore timeseries datafiles. It handles mapping groups of files that belong in the same experiment, separating them by channel in the case of multichannel experimental operations, and time-ordering files within a channel when many data files are written as part of a single experiment. Subsequently, it provides a common API through which to interact with that data, effectively standardizing data reading operations regardless of the source. Given the number of different file formats commonly in use in the nanopore field, this plugin will likely always have the largest number of subclasses.

What you get by inheriting from MetaReader

Regardless of the details of how your data is actually stored, MetaReader will provide a common and intuitive API with which to interact with it, stitching together all the files in your dataset to work seamlessly together as a single dataset. Datasets are broken down by channel ID and time, allowing slicing into data that might be spread across multiple files as though it were a single contiguous memory structure. Data can be retrieved either on an ad-hoc basis, or as a continuous generator that allows you to iterate through on demand. Metadata like sampling rate, the length of data available in each channel, etc., can be retrieved through the API directly.

Required Public API Methods

MetaReader.get_empty_settings(globally_available_plugins: Dict[str, List[str]] | None = None, standalone=False) Dict[str, Dict[str, Any]]
Parameters:
  • globally_available_plugins (Optional[Mapping[str, List[str]]]) – a dict containing all data plugins that exist to date, keyes by metaclass

  • standalone – True if this is outside the context of a GUI, False otherwise, Default False.

Returns:

the dict that must be filled in to initialize the filter

Return type:

Dict[str, Dict[str, Any]]

Purpose: Provide a list of settings details to users to assist in instantiating an instance of your MetaReader subclass.

Get a dict populated with keys needed to initialize the filter if they are not set yet. This dict must have the following structure, but Min, Max, and Options can be skipped or explicitly set to None if they are not used. Value and Type are required. All values provided must be consistent with Type.

settings = {'Parameter 1': {'Type': <int, float, str, bool>,
                                  'Value': <value> or None,
                                  'Options': [<option_1>, <option_2>, ... ] or None,
                                  'Min': <min_value> or None,
                                  'Max': <max_value> or None,
                                  'Units': <unit str> or None
                              },
                              ...
                              }

Several parameter keywords are reserved: these are

‘Input File’ ‘Output File’ ‘Folder’ and all MetaClass names

These must have Type str and will cause the GUI to generate appropriate widgets to allow selection of these elements when used.

This function must implement returning of a dictionary of settings required to initialize the filter, in the specified format. Values in this dictionary can be accessed downstream through the self.settings class variable. This structure is a nested dictionary that supplies both values and a variety of information about those values, used by poriscope to perform sanity and consistency checking at instantiation.

While this function is technically not abstract in MetaReader, which already has an implementation of this function that ensures that settings will have the required MetaReader key available to users, in most cases you will need to override it to add any other settings required by your subclass. The implementation in MetaReader provides a single key, Input File, without specifing file options. If you need additional settings, or if you want to specify the file types that will show up in related file dialogs and be accepted as inputs (recommended) you would override this to specidy options, but you MUST call settings = super().get_empty_settings(globally_available_plugins, standalone) first, which will ensure the existence of the “Input File” key. For example:

settings = super().get_empty_settings(globally_available_plugins, standalone)
settings["Input File"]["Options"] = ["ABF2 Files (*.abf)"]
settings["Your Key"] = {"Type": float,
                        "Value": None,
                        "Min": 0.0,
                        "Units": "pA"
                        }
return settings

which will ensure that your have key specified above, as well as an additional key, Input File, as required by readers. You can learn more about formatting input file option strings in the PySide6 module documentation. In the case of multiple file types, supply the relevant strings as a comma-separated list in the “Options” key; poriscope will handle formatting it for PySide6.

abstractmethod MetaReader.close_resources(channel: int | None = None) None
Parameters:

channel (Optional[int]) – channel ID

Purpose: Clean up any open file handles or memory.

This is called during app exit or plugin deletion to ensure proper cleanup of resources that could otherwise leak. If channel is not None, handle only that channel, else close all of them. If no such operation is needed, it suffices to pass. Note that readers that operate based on memmaps need not explicitly close those memmaps, as they will be handled by the garbage collector, but it does no harm to do so. Any open file handles should be closed explicitly if not closed at the end of read operations.

abstractmethod MetaReader.reset_channel(channel: int | None = None) None

Perform any actions necessary to gracefully close resources before app exit. :param channel: channel ID :type channel: Optional[int]

Purpose: Reset the state of a specific channel for a new operation or run.

This is called any time an operation on a channel needs to be cleaned up or reset for a new run. If channel is not None, handle only that channel, else close all of them. If reading through a channel does not create any persistent state changes in your plugin, you can simply pass this function.

Required Private Methods

abstractmethod MetaReader._init() None

Purpose: Perform generic class construction operations.

All data plugins have this function and must provide an implementation. This is called immediately at the start of class creation and is used to do whatever is required to set up your reader. Note that no app settings are available when this is called, so this function should be used only for generic class construction operations. Most readers simply pass this function.

abstractmethod MetaReader._set_file_extension() str
Returns:

the file extension

Return type:

str

Purpose: Set the file extension for the file type this reader plugin handles.

This is a simple function that allows you to set the file extension (including the leading dot) of the file type that this reader plugin will read. It is used by downstream functions while mapping your data to assist in identifying files. It should be a single line:

return ".ext"

If you need to refer to this value again, you can access it via the class variable self.file_extension.

abstractmethod MetaReader._set_raw_dtype(configs: List[dict]) dtype

Set the data type for the raw data in files of this type

Parameters:

configs (List[dict]) – List of configuration dictionaries corresponding to data files.

Returns:

the dtype of the raw data in your data files

Return type:

np.dtype

Purpose: Inform Poriscope what NumPy datatype to expect for raw data on disk.

This function is used to tell Poriscope what datatype to expect on disk for downstream use by _map_data(). You should return a NumPy dtype object. For example, if you are using a 16-bit ADC code, you might return np.uint16. This is also a single-line function:

return np.uint16

If you need to refer to this value again, you can access it via the class variable self.dtype. For more details on NumPy dtypes, refer to the NumPy documentation on dtypes.

abstractmethod MetaReader._get_file_pattern(file_name: str) str
Parameters:

file_name (os.PathLike) – File name to get the base pattern for.

Returns:

Base pattern for matching other files.

Return type:

str

Purpose: Extract a glob pattern from an input filename to match all dataset files.

When you instantiate a reader plugin, you provide a single filename as input. However, in some cases, a dataset might comprise many files. This function requires you to extract a pattern from the given filename that can be used by glob to match all files belonging to your dataset.

If your dataset consists of only a single file, you can simply return the original filename.

Example:

Consider a scenario where your dataset files follow a pattern with channel numbers and serial numbers, such as:

  • experiment_1_channel_01_001.log

  • experiment_1_channel_01_002.log

  • experiment_1_channel_02_001.log

  • experiment_1_channel_02_002.log

In this case, you could return a glob pattern like:

experiment_1_channel_??_???.log

This pattern assumes the channel stamp will always be two digits and the serial number always three. If the lengths of these varying parts are uncertain, a more general pattern using wildcards would be:

experiment_1_channel_*_*.log

Poriscope will use this file pattern to search the folder of the input file for other files that match the pattern. It will not search outside of that folder.

For more information on glob patterns, refer to the glob module documentation.

abstractmethod MetaReader._get_configs(datafiles: List[PathLike]) List[dict]
Parameters:

datafiles (List[os.PathLike]) – List of data files for which to load configurations.

Returns:

List of configuration dictionaries.

Return type:

List[dict]

Purpose: Extract configuration metadata from dataset files.

Given a list of filenames corresponding to the data files, construct a list of dictionaries containing any required configurations for use downstream. Your config dictionaries must have at a minimum the key ‘samplerate’ in them, and the list of configs must correspond one-to-one to the provided list of data files. All files in a dataset must have the same samplerate. Your reader will use these configs to map the data on disk, so you could include information like endianness, raw data type, details of any columns within the data, etc. Aside from the required samplerate key, this can be anything.

abstractmethod MetaReader._get_file_time_stamps(file_names: List[PathLike], configs: List[dict]) List[str | int | float | datetime | date | datetime64]
Parameters:
  • file_names (List[os.PathLike]) – List of file names to get time stamps for.

  • configs (List[dict]) – List of configuration dictionaries corresponding to data files.

Returns:

List of serialization keys for timestamps in almost any format.

Return type:

List[Union[str, int, float, datetime.datetime, datetime.date, np.datetime64]]

Purpose: Extract time stamps for sorting files chronologically within a channel.

Given a list of all the files in the experiment and the list of config dictionaries you defined above, extract a corresponding list of timestamps. These timestamps will be used to time-order the mapped data within each channel. The list must have the same length as both input lists and must be of a type that can be sorted into the desired time-ordering using the builtin sort() method.

abstractmethod MetaReader._get_file_channel_stamps(file_names: List[PathLike], configs: List[dict]) List[int]
Parameters:
  • file_names (List[os.PathLike]) – List of file names to get channel stamps for.

  • configs (List[dict]) – List of configuration dictionaries corresponding to data files.

Returns:

List of serialization keys for channels

Return type:

List[int]

Purpose: Extract channel identifiers for grouping files by channel.

Given a list of all the files in the experiment and the list of config dictionaries you defined above, extract a corresponding list of channel identifiers as integers. These channel indices will be used to group the mapped data by channel. The list must have the same length as both input lists and must be a list of integers.

abstractmethod MetaReader._map_data(datafiles: List[PathLike], configs: List[dict]) List[ndarray[tuple[int, ...], dtype[Any]]]
Parameters:
  • datafiles (List[os.PathLike]) – List of data files to map.

  • configs (List[dict]) – List of configuration dictionaries corresponding to data files.

Returns:

List of memmaps or numpy arrays mapped from data files.

Return type:

List[numpy.ndarray]

Purpose: Map the provided data files into an accessible format, preferably memory-mapped views.

Using all the information provided in the implementations so far, in this function, you are asked to map the list of files provided in datafiles, according to information given in configs. You can assume that the lists are of equal length and that the config file at a given index corresponds to the data file at the same index. You must return a list of views into those files. We strongly encourage the use of memmap where possible, in which case you may return a list of such memmaps with length equal to the input list of filenames.

Warning

This function expects that the elements of the returned list can be indexed and sliced into like NumPy arrays, hence the suggestion to use memmaps, which avoid the need to actually load raw data into RAM before it is needed. In cases where memmap is not an option, you must still return NumPy array for each file, which may involve significant memory consumption. If this is impractical, it is possible to override this function to return, for example, a list of file handles instead, with the caveat that this will in turn require that you completely override load_data() as well to properly handle your file access method manually.

abstractmethod MetaReader._convert_data(data: ndarray[tuple[int, ...], dtype[int16]], config: dict, raw_data: bool = False) ndarray[tuple[int, ...], dtype[float64]]
Parameters:
  • data (numpy.ndarray) – Data to convert.

  • config (dict) – Configuration dictionary for data conversion.

  • raw_data (bool) – Decide whether to rescale data or return raw adc codes

Returns:

Converted data, and scale and offset if and only if raw_data is True

Return type:

Union[Tuple[np.ndarray, float, float], np.ndarray]

Purpose: Convert raw data from disk format to a usable numerical format.

Given a numpy array of raw data extracted from one of the memmap instances you defined in the previous function along with its associated config dict, provide a means to turn this raw data into a numpy array of ~numpy.float64 double precision floats. For this purpose, if convenient, you can use the _scale_data() function, which will apply bitmasks, multiply data by a scaling factor, and add an offset, like so:

def _scale_data(self, data: npt.NDArray[Any], copy:Optional[bool]=True, bitmask:Optional[np.uint64]=None, dtype:Optional[str]=None, scale:Optional[float]=None, offset:Optional[float]=None, raw_data:Optional[bool]=False) -> npt.NDArray[Any]:
    if bitmask == 0:
        bitmask = None
    if not raw_data:
        if (copy):
            data = np.copy(data)
        if (bitmask is not None):
            data = np.bitwise_and(data.astype(type(bitmask)), bitmask)
        if (dtype is not None):
            data = data.astype(dtype)
        if (scale is not None):
            data *= scale
        if (offset is not None):
            data += offset
        return data
    else:
        if not dtype:
            raise ValueError('Specify dtype to retrieve raw data')
    return data

if raw_data is True, your function must also return a scale and offset factor, like so:

if raw_data:
        return data, scale, offset
else:
        return data
abstractmethod MetaReader._validate_settings(settings: dict) None

Validate that the settings dict contains the correct information for use by the subclass.

Parameters:

settings (dict) – Parameters for event detection.

Raises:

ValueError – If the settings dict does not contain the correct information.

Optional Method Overrides

Methods in this section have an implementation in either BaseDataPlugin or MetaReader, but they can be overridden if necessary to tweak the behavior of your plugin.

MetaReader.force_serial_channel_operations() bool
Returns:

True if only one channel can run at a time, False otherwise

Return type:

bool

Purpose: Indicate whether operations on different channels must be serialized (not run in parallel).

By default this simply returns False, meaning that it is acceptable and thread-safe to run operations on different channels in different threads on this plugin. If such operation is not thread-safe, this function should be overridden to simply return True. Most readers are thread-safe since reading from a file on disk is usually so, and therefore no override is necessary.

MetaReader._finalize_initialization() None

Purpose: Apply application-specific settings to the plugin, if needed.

If additional initialization operations are required beyond the defaults provided in BaseDataPlugin or MetaReader that must occur after settings have been applied to the reader instance, you can override this function to add those operations, subject to the caveat below.

Warning

This function implements core functionality required for broader plugin integration into Poriscope. If you do need to override it, you MUST call super()._finalize_initialization() before any additional code that you add, and take care to understand the implementation of both apply_settings() and _finalize_initialization() before doing so to ensure that you are not conflicting with those functions.