The fundamental components: Datasets, Algorithms, and Pipelines#

For a typical data analysis we want to apply an algorithm to a dataset. This usually requires you to write: (a) some code to load and format your data, (b) your actual algorithm, and (c) some sort of “gluing code” that brings both sides together. To ensure reusability, it is a good idea to keep (c) explicitly separate from (a) and (b), aka, you don’t want your data loading to be specific for your algorithm or your algorithm interface specific to your dataset.

Algorithms#

To ensure that, the interface of an algorithm should only require the input data it really needs, and all inputs should use as simple data structures as possible. For example, this means that an algorithm should only get the data of a single recording as input and not a data structure containing multiple recordings. Looping over multiple recordings and/or participants should be something handled by the “gluing code”.

# Bad idea:
def run_algorithm(dataset: CustomDatasetObject):
    ...

# Better:
def run_algorithm(imu_data: np.ndarray, sampling_rate_hz: float):
    ...

If multiple algorithms can be used equivalently (e.g., two algorithms to detect R-peaks in an ECG signal), you should ensure that the interfaces of the algorithms are identical, or at least as similar as possible, so that your gluing code requires minimal modification when changing algorithms. To make this idea of a shared interface easier, we represent Algorithms as classes in tpcp that get all their algorithm-specific configuration via the __init__ function.

Note

Algorithms are simple classes that get configuration parameters during initialization and that have an “action” method which can be used to apply the algorithm to some data. All algorithms should be subclasses of Algorithm. If two algorithms can perform the same functionality, their action methods should adhere to the same interface. Some algorithms might further define a self_optimize method that is able to “train” certain input parameters based on provided data.

Datasets#

With your data loading code you usually want to abstract the complexity of data loading and provide a simple-to-use interface to your data for your gluing code independent of the actual format and structure of the data on disc. To make writing gluing code as simple as possible, it is a good idea to follow some form of standards with the loaded data. This could be standards that are designed for yourself, for your work group, or your entire scientific field. The only important thing is that you are consistent whenever you write data loading code. As an example, you should always provide data in the same units after loading and represent it with the same (ideally simple) data structure (e.g. 3-d acceleration is always a numpy array of shape 3 x n with axis order x,y,z and all values in m/s). Using any form of standards means that you can reuse a lot of your gluing code across multiple datasets.

Going one step further, in tpcp each dataset is a custom class inheriting from Dataset. This ensures that independent of the actual data you are working with (tabular, metadata, timeseries, some crazy combination of everything), a common “standardized” datatype exists that can be used by high level utility functions like cross_validate.

Note

Datasets are custom classes that inherit from Dataset. At their core each Dataset class only provides an index of all the data that are available. This makes it possible for generic utility functions to iterate or split datasets. It is up to the user to add additional methods and properties to a dataset that represent the actual data that can be used by an algorithm.

Pipelines#

In the ideal case this leads to a scenario where you can use the same gluing code to run multiple different algorithms on multiple different datasets, because they all share common interfaces. In tpcp we call this gluing code pipeline.

Note

Pipelines are custom classes with a strictly defined interface that subclass Pipeline. They have a single run method, that takes an instance of a Dataset representing a single data point as input. Within the run method the pipeline is expected to retrieve the required data from the dataset object, pass it to one or multiple algorithms and provide results in a format that make sense for a given application. Some pipelines might additionally define a self_optimize method that is able to “train” certain input parameters based on the provided data.

../_images/algos_simple.svg

In a simple case, a single pipeline can interface between all available Datasets and all Algorithms, because they share a common interface.#

However, it is usually impossible to produce the exact same data interface for multiple different datasets, even within the same domain. Datasets might have different measurement procedures and different measurement modalities. In the same way, you might have different types of analyses you want to perform and, hence, require the use of different algorithms. This means that you will often end up with multiple pipelines (even within a single project) that connect one data interface (that might be shared by multiple datasets) with multiple algorithm interfaces for different types of analysis.

../_images/algos_complicated.svg

Pipelines act as gluing code for one Dataset interface with one or multiple Algorithm interfaces to perform one specific analysis.#

Note that even though we consider these as different pipelines, as they are designed for different analysis, they might still share code (e.g., use the same utility functions, or have a common parent class), so that writing a new Pipeline is often very easy.