.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/datasets/_01_datasets_basics.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_datasets__01_datasets_basics.py: .. _custom_dataset_basics: Custom Dataset - Basics ======================= Datasets represent a set of recordings that should all be processed in the same way. For example the data of multiple participants in a study, multiple days of recording, or multiple tests. The goal of datasets is to provide a consistent interface to access the raw data, metadata, and potential reference information in an object-oriented way. It is up to you to define, what is considered a single "data-point" for your dataset. Note, that datasets can be arbitrarily nested (e.g. multiple participants with multiple recordings). Datasets work best in combination with `Pipelines` and are further compatible with concepts like `GridSearch` and `cross_validation`. .. GENERATED FROM PYTHON SOURCE LINES 19-46 Defining your own dataset ------------------------- Fundamentally you only need to create a subclass of :func:`~tpcp.Dataset` and define the `create_index` method. This method should return a dataframe describing all the data-points that should be available in the dataset. .. warning:: Make absolutely sure that the dataframe you return is deterministic and does not change between runs! This can lead to some nasty bugs! We try to catch them internally, but it is not always possible. As tips, avoid reliance on random numbers and make sure that the order is not depend on things like file system order, when creating an index by scanning a directory. Particularly nasty are cases when using non-sorted container like `set`, that sometimes maintain their order, but sometimes don't. At the very least, we recommend to sort the final dataframe you return in `create_index`. In the following we will create an example dataset, without any real world data, but it can be used to demonstrate most functionality. At the end we will discuss, how gait specific data should be integrated. We will define an index that contains 5 participants, with 3 recordings each. Recording 3 has 2 trials, while the others have only one. Note, that we implement this as a static index here, but most of the time, you would create the index by e.g. scanning and listing the files in your data directory. It is important that you don't want to load the entire actual data (e.g. the imu samples) in memory, but just list the available data-points in the index. Then you can filter the dataset first and load the data once you know which data-points you want to access. We will discuss this later in the example. .. GENERATED FROM PYTHON SOURCE LINES 46-57 .. code-block:: default from itertools import product from typing import Optional, Union import pandas as pd trials = list(product(("rec_1", "rec_2", "rec_3"), ("trial_1",))) trials.append(("rec_3", "trial_2")) index = [(p, *t) for p, t in product((f"p{i}" for i in range(1, 6)), trials)] index = pd.DataFrame(index, columns=["participant", "recording", "trial"]) index .. raw:: html
participant recording trial
0 p1 rec_1 trial_1
1 p1 rec_2 trial_1
2 p1 rec_3 trial_1
3 p1 rec_3 trial_2
4 p2 rec_1 trial_1
5 p2 rec_2 trial_1
6 p2 rec_3 trial_1
7 p2 rec_3 trial_2
8 p3 rec_1 trial_1
9 p3 rec_2 trial_1
10 p3 rec_3 trial_1
11 p3 rec_3 trial_2
12 p4 rec_1 trial_1
13 p4 rec_2 trial_1
14 p4 rec_3 trial_1
15 p4 rec_3 trial_2
16 p5 rec_1 trial_1
17 p5 rec_2 trial_1
18 p5 rec_3 trial_1
19 p5 rec_3 trial_2


.. GENERATED FROM PYTHON SOURCE LINES 58-61 Now we use this index as the index of our new dataset. To see the dataset in action, we need to create an instance of it. Its string representation will show us the most important information. .. GENERATED FROM PYTHON SOURCE LINES 61-72 .. code-block:: default from tpcp import Dataset class CustomDataset(Dataset): def create_index(self): return index dataset = CustomDataset() dataset .. raw:: html

CustomDataset [20 groups/rows]

participant recording trial
0 p1 rec_1 trial_1
1 p1 rec_2 trial_1
2 p1 rec_3 trial_1
3 p1 rec_3 trial_2
4 p2 rec_1 trial_1
5 p2 rec_2 trial_1
6 p2 rec_3 trial_1
7 p2 rec_3 trial_2
8 p3 rec_1 trial_1
9 p3 rec_2 trial_1
10 p3 rec_3 trial_1
11 p3 rec_3 trial_2
12 p4 rec_1 trial_1
13 p4 rec_2 trial_1
14 p4 rec_3 trial_1
15 p4 rec_3 trial_2
16 p5 rec_1 trial_1
17 p5 rec_2 trial_1
18 p5 rec_3 trial_1
19 p5 rec_3 trial_2


.. GENERATED FROM PYTHON SOURCE LINES 73-79 Subsets ------- When working with a dataset, the first thing is usually to select the data you want to use. For this, you can primarily use the method `get_subset`. Here we want to select only recording 2 and 3 from participant 1 to 4. Note that the returned subset is an instance of your dataset class as well. .. GENERATED FROM PYTHON SOURCE LINES 79-82 .. code-block:: default subset = dataset.get_subset(participant=["p1", "p2", "p3", "p4"], recording=["rec_2", "rec_3"]) subset .. raw:: html

CustomDataset [12 groups/rows]

participant recording trial
0 p1 rec_2 trial_1
1 p1 rec_3 trial_1
2 p1 rec_3 trial_2
3 p2 rec_2 trial_1
4 p2 rec_3 trial_1
5 p2 rec_3 trial_2
6 p3 rec_2 trial_1
7 p3 rec_3 trial_1
8 p3 rec_3 trial_2
9 p4 rec_2 trial_1
10 p4 rec_3 trial_1
11 p4 rec_3 trial_2


.. GENERATED FROM PYTHON SOURCE LINES 83-85 The subset can then be filtered further. For more advanced filter approaches you can also filter the index directly and use a bool-map to index the dataset .. GENERATED FROM PYTHON SOURCE LINES 85-89 .. code-block:: default example_bool_map = subset.index["participant"].isin(["p1", "p2"]) final_subset = subset.get_subset(bool_map=example_bool_map) final_subset .. raw:: html

CustomDataset [6 groups/rows]

participant recording trial
0 p1 rec_2 trial_1
1 p1 rec_3 trial_1
2 p1 rec_3 trial_2
3 p2 rec_2 trial_1
4 p2 rec_3 trial_1
5 p2 rec_3 trial_2


.. GENERATED FROM PYTHON SOURCE LINES 90-97 Iteration and Groups -------------------------------- After selecting the part of the data you want to use, you usually want/need to iterate over the data to apply your processing steps. By default, you can simply iterate over all rows. Note, that each row itself is a dataset again, but just with a single entry. .. GENERATED FROM PYTHON SOURCE LINES 97-101 .. code-block:: default for row in final_subset: print(row) print(f"This row contains {len(row)} data-point", end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none CustomDataset [1 groups/rows] participant recording trial 0 p1 rec_2 trial_1 This row contains 1 data-point CustomDataset [1 groups/rows] participant recording trial 0 p1 rec_3 trial_1 This row contains 1 data-point CustomDataset [1 groups/rows] participant recording trial 0 p1 rec_3 trial_2 This row contains 1 data-point CustomDataset [1 groups/rows] participant recording trial 0 p2 rec_2 trial_1 This row contains 1 data-point CustomDataset [1 groups/rows] participant recording trial 0 p2 rec_3 trial_1 This row contains 1 data-point CustomDataset [1 groups/rows] participant recording trial 0 p2 rec_3 trial_2 This row contains 1 data-point .. GENERATED FROM PYTHON SOURCE LINES 102-106 However, in many cases, we don't want to iterate over all rows, but rather iterate over groups of the datasets ( e.g. all participants or all tests) individually. We can do that in 2 ways (depending on what is needed). For example, if we want to iterate over all recordings, we can do this: .. GENERATED FROM PYTHON SOURCE LINES 106-109 .. code-block:: default for trial in final_subset.iter_level("recording"): print(trial, end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none CustomDataset [2 groups/rows] participant recording trial 0 p1 rec_2 trial_1 1 p2 rec_2 trial_1 CustomDataset [4 groups/rows] participant recording trial 0 p1 rec_3 trial_1 1 p1 rec_3 trial_2 2 p2 rec_3 trial_1 3 p2 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 110-115 You can see that we get two subsets, one for each recording label. But what, if we want to iterate over the participants and the recordings together? In this case, we need to group our dataset first. Note that the grouped_subset shows the new groupby columns as the index in the representation and the length of the dataset is reported to be the number of groups. .. GENERATED FROM PYTHON SOURCE LINES 115-119 .. code-block:: default grouped_subset = final_subset.groupby(["participant", "recording"]) print(f"The dataset contains {len(grouped_subset)} groups.") grouped_subset .. rst-class:: sphx-glr-script-out .. code-block:: none The dataset contains 4 groups. .. raw:: html

CustomDataset [4 groups/rows]

participant recording trial
participant recording
p1 rec_2 p1 rec_2 trial_1
rec_3 p1 rec_3 trial_1
rec_3 p1 rec_3 trial_2
p2 rec_2 p2 rec_2 trial_1
rec_3 p2 rec_3 trial_1
rec_3 p2 rec_3 trial_2


.. GENERATED FROM PYTHON SOURCE LINES 120-124 If we now iterate the dataset, it will iterate over the unique groups. Grouping also changes the meaning of a "single datapoint". Each group reports a shape of `(1,)` independent of the number of rows in each group. .. GENERATED FROM PYTHON SOURCE LINES 124-128 .. code-block:: default for group in grouped_subset: print(f"This group has the shape {group.shape}") print(group, end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none This group has the shape (1,) CustomDataset [1 groups/rows] participant recording trial participant recording p1 rec_2 p1 rec_2 trial_1 This group has the shape (1,) CustomDataset [1 groups/rows] participant recording trial participant recording p1 rec_3 p1 rec_3 trial_1 rec_3 p1 rec_3 trial_2 This group has the shape (1,) CustomDataset [1 groups/rows] participant recording trial participant recording p2 rec_2 p2 rec_2 trial_1 This group has the shape (1,) CustomDataset [1 groups/rows] participant recording trial participant recording p2 rec_3 p2 rec_3 trial_1 rec_3 p2 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 129-132 At any point, you can view all unique groups/rows in the dataset using the `group_labels` attribute. The order shown here, is the same order used when iterating the dataset. When creating a new subset, the order might change! .. GENERATED FROM PYTHON SOURCE LINES 132-134 .. code-block:: default grouped_subset.group_labels .. rst-class:: sphx-glr-script-out .. code-block:: none [CustomDatasetGroupLabel(participant='p1', recording='rec_2'), CustomDatasetGroupLabel(participant='p1', recording='rec_3'), CustomDatasetGroupLabel(participant='p2', recording='rec_2'), CustomDatasetGroupLabel(participant='p2', recording='rec_3')] .. GENERATED FROM PYTHON SOURCE LINES 135-143 .. note:: The `group_labels` attribute consists of a list of `named tuples `_. The tuple elements are named after the groupby columns and are in the same order as the groupby columns. They can be accessed by name or index: For example, `grouped_subset.group_labels[0].participant` and `grouped_subset.group_labels[0][0]` are equivalent. Also, `grouped_subset.group_labels[0]` and `grouped_subset[0].group_label` are equivalent. .. GENERATED FROM PYTHON SOURCE LINES 146-147 Note that for an "un-grouped" dataset, this corresponds to all rows. .. GENERATED FROM PYTHON SOURCE LINES 147-149 .. code-block:: default final_subset.group_labels .. rst-class:: sphx-glr-script-out .. code-block:: none [CustomDatasetGroupLabel(participant='p1', recording='rec_2', trial='trial_1'), CustomDatasetGroupLabel(participant='p1', recording='rec_3', trial='trial_1'), CustomDatasetGroupLabel(participant='p1', recording='rec_3', trial='trial_2'), CustomDatasetGroupLabel(participant='p2', recording='rec_2', trial='trial_1'), CustomDatasetGroupLabel(participant='p2', recording='rec_3', trial='trial_1'), CustomDatasetGroupLabel(participant='p2', recording='rec_3', trial='trial_2')] .. GENERATED FROM PYTHON SOURCE LINES 150-152 If you want to view the full set of labels of a dataset regardless of the grouping, you can use the `index_as_tuples` method. .. GENERATED FROM PYTHON SOURCE LINES 152-154 .. code-block:: default grouped_subset.index_as_tuples() .. rst-class:: sphx-glr-script-out .. code-block:: none [CustomDatasetGroupLabel(participant='p1', recording='rec_2', trial='trial_1'), CustomDatasetGroupLabel(participant='p1', recording='rec_3', trial='trial_1'), CustomDatasetGroupLabel(participant='p1', recording='rec_3', trial='trial_2'), CustomDatasetGroupLabel(participant='p2', recording='rec_2', trial='trial_1'), CustomDatasetGroupLabel(participant='p2', recording='rec_3', trial='trial_1'), CustomDatasetGroupLabel(participant='p2', recording='rec_3', trial='trial_2')] .. GENERATED FROM PYTHON SOURCE LINES 155-156 Note that `index_as_tuples()` and `group_labels` return the same for an un-grouped dataset. .. GENERATED FROM PYTHON SOURCE LINES 156-158 .. code-block:: default final_subset.index_as_tuples() .. rst-class:: sphx-glr-script-out .. code-block:: none [CustomDatasetGroupLabel(participant='p1', recording='rec_2', trial='trial_1'), CustomDatasetGroupLabel(participant='p1', recording='rec_3', trial='trial_1'), CustomDatasetGroupLabel(participant='p1', recording='rec_3', trial='trial_2'), CustomDatasetGroupLabel(participant='p2', recording='rec_2', trial='trial_1'), CustomDatasetGroupLabel(participant='p2', recording='rec_3', trial='trial_1'), CustomDatasetGroupLabel(participant='p2', recording='rec_3', trial='trial_2')] .. GENERATED FROM PYTHON SOURCE LINES 159-161 We can use the group labels (or a subset of them) to index our dataset. This can be in particular helpful, if you want to recreate specific train test splits provided by `cross_validate`. .. GENERATED FROM PYTHON SOURCE LINES 161-163 .. code-block:: default final_subset.get_subset(group_labels=final_subset.group_labels[:3]) .. raw:: html

CustomDataset [3 groups/rows]

participant recording trial
0 p1 rec_2 trial_1
1 p1 rec_3 trial_1
2 p1 rec_3 trial_2


.. GENERATED FROM PYTHON SOURCE LINES 164-166 If you want, you can also ungroup a dataset again. This can be useful for a nested iteration: .. GENERATED FROM PYTHON SOURCE LINES 166-172 .. code-block:: default for outer, group in enumerate(grouped_subset): ungrouped = group.groupby(None) for inner, subgroup in enumerate(ungrouped): print(outer, inner) print(subgroup, end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none 0 0 CustomDataset [1 groups/rows] participant recording trial 0 p1 rec_2 trial_1 1 0 CustomDataset [1 groups/rows] participant recording trial 0 p1 rec_3 trial_1 1 1 CustomDataset [1 groups/rows] participant recording trial 0 p1 rec_3 trial_2 2 0 CustomDataset [1 groups/rows] participant recording trial 0 p2 rec_2 trial_1 3 0 CustomDataset [1 groups/rows] participant recording trial 0 p2 rec_3 trial_1 3 1 CustomDataset [1 groups/rows] participant recording trial 0 p2 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 173-180 Splitting --------- If you are evaluating algorithms, it is often important to split your data into a train and a test set, or multiple distinct sets for a cross validation. The `Dataset` objects directly support the `sklearn` helper functions for this. For example, to split our subset into training and testing we can do the following: .. GENERATED FROM PYTHON SOURCE LINES 180-186 .. code-block:: default from sklearn.model_selection import train_test_split train, test = train_test_split(final_subset, train_size=0.5) print("Train:\n", train, end="\n\n") print("Test:\n", test) .. rst-class:: sphx-glr-script-out .. code-block:: none Train: CustomDataset [3 groups/rows] participant recording trial 0 p1 rec_3 trial_2 1 p1 rec_2 trial_1 2 p2 rec_3 trial_1 Test: CustomDataset [3 groups/rows] participant recording trial 0 p2 rec_2 trial_1 1 p2 rec_3 trial_2 2 p1 rec_3 trial_1 .. GENERATED FROM PYTHON SOURCE LINES 187-189 Such splitting always occurs on a data-point level and can therefore be influenced by grouping. If we want to split our datasets into training and testing, but only based on the participants, we can do this: .. GENERATED FROM PYTHON SOURCE LINES 189-193 .. code-block:: default train, test = train_test_split(final_subset.groupby("participant"), train_size=0.5) print("Train:\n", train, end="\n\n") print("Test:\n", test) .. rst-class:: sphx-glr-script-out .. code-block:: none Train: CustomDataset [1 groups/rows] participant recording trial participant p1 p1 rec_2 trial_1 p1 p1 rec_3 trial_1 p1 p1 rec_3 trial_2 Test: CustomDataset [1 groups/rows] participant recording trial participant p2 p2 rec_2 trial_1 p2 p2 rec_3 trial_1 p2 p2 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 194-196 In the same way you can use the dataset (grouped or not) with the cross validation helper functions (KFold is just an example, all should work): .. GENERATED FROM PYTHON SOURCE LINES 196-204 .. code-block:: default from sklearn.model_selection import KFold cv = KFold(n_splits=2) grouped_subset = final_subset.groupby("participant") for train, test in cv.split(grouped_subset): # We only print the train set here print(grouped_subset[train], end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none CustomDataset [1 groups/rows] participant recording trial participant p2 p2 rec_2 trial_1 p2 p2 rec_3 trial_1 p2 p2 rec_3 trial_2 CustomDataset [1 groups/rows] participant recording trial participant p1 p1 rec_2 trial_1 p1 p1 rec_3 trial_1 p1 p1 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 205-211 While this works well, it is not always what we want. Sometimes, we still want to consider each row a single datapoint, but want to prevent that data of e.g. a single participant and recording is partially put into train- and partially into the test-split. For this, we can use `GroupKFold` in combination with `dataset.create_string_group_labels`. `create_string_group_labels` generates a unique string identifier for each row/group: .. GENERATED FROM PYTHON SOURCE LINES 211-214 .. code-block:: default group_labels = final_subset.create_string_group_labels(["participant", "recording"]) group_labels .. rst-class:: sphx-glr-script-out .. code-block:: none ["('p1', 'rec_2')", "('p1', 'rec_3')", "('p1', 'rec_3')", "('p2', 'rec_2')", "('p2', 'rec_3')", "('p2', 'rec_3')"] .. GENERATED FROM PYTHON SOURCE LINES 215-217 They can then be used as the `group` parameter in `GroupKFold` (and similar methods). Now the data of the two participants is never split between train and test set. .. GENERATED FROM PYTHON SOURCE LINES 217-224 .. code-block:: default from sklearn.model_selection import GroupKFold cv = GroupKFold(n_splits=2) for train, test in cv.split(final_subset, groups=group_labels): # We only print the train set here print(final_subset[train], end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none CustomDataset [3 groups/rows] participant recording trial 0 p1 rec_2 trial_1 1 p1 rec_3 trial_1 2 p1 rec_3 trial_2 CustomDataset [3 groups/rows] participant recording trial 0 p2 rec_2 trial_1 1 p2 rec_3 trial_1 2 p2 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 225-228 Instead of doing this manually, we also provide a custom splitter that does this for you. It allows us to directly put the dataset into the `split` method of `cross_validate` and use higher level semantics to specify the grouping and stratification. .. GENERATED FROM PYTHON SOURCE LINES 228-237 .. code-block:: default from tpcp.validate import DatasetSplitter cv = DatasetSplitter(GroupKFold(n_splits=2), groupby=["participant", "recording"]) for train, test in cv.split(final_subset): # We only print the train set here print(final_subset[train], end="\n\n") .. rst-class:: sphx-glr-script-out .. code-block:: none CustomDataset [3 groups/rows] participant recording trial 0 p1 rec_2 trial_1 1 p1 rec_3 trial_1 2 p1 rec_3 trial_2 CustomDataset [3 groups/rows] participant recording trial 0 p2 rec_2 trial_1 1 p2 rec_3 trial_1 2 p2 rec_3 trial_2 .. GENERATED FROM PYTHON SOURCE LINES 238-242 Creating labels also works for datasets that are already grouped. But, the columns that should be contained in the label must be a subset of the groupby columns in this case. The number of group labels is 4 in this case, as there are only 4 groups after grouping the datset. .. GENERATED FROM PYTHON SOURCE LINES 242-245 .. code-block:: default group_labels = final_subset.groupby(["participant", "recording"]).create_string_group_labels("participant") group_labels .. rst-class:: sphx-glr-script-out .. code-block:: none ['p1', 'p1', 'p2', 'p2'] .. GENERATED FROM PYTHON SOURCE LINES 246-265 Adding Data ----------- So far we only operated on the index of the dataset. But if we want to run algorithms, we need the actual data (i.e. IMU samples, clinical data, ...). Because the data and the structure of the data can vary widely from dataset to dataset, it is up to you to implement data access. It comes down to documentation to ensure that users access the data in the correct way. In general, we try to follow a couple of conventions to give datasets a consistent feel: - Data access should be provided via `@property` decorator on the dataset objects, loading the data on demand. - The names of these properties should follow some the naming scheme (e.g. `data` for the core sensor data) and should return values using the established datatypes (e.g. `pd.DataFrames`). - The names of values that represent gold standard information (i.e. values you would only have in an evaluation dataset and should never use for training), should have a trailing `_`, which marks them as result similar how sklearn handles it. This should look something like this: .. GENERATED FROM PYTHON SOURCE LINES 265-287 .. code-block:: default class CustomDataset(Dataset): @property def data(self) -> pd.DataFrame: # Some logic to load data from disc raise NotImplementedError() @property def sampling_rate_hz(self) -> float: return 204.8 @property def reference_events_(self) -> pd.DataFrame: # Some custom logic to load the gold-standard events of this validation dataset. # Note the trailing `_` in the name. raise NotImplementedError() def create_index(self): return index .. GENERATED FROM PYTHON SOURCE LINES 288-303 For each of the data-values you need to decide, on which "level" you provide data access. Meaning, do you want/can return data, when there are still multiple participants/recordings in the dataset, or can you only return the data, when there is only a single trial of a single participant left. Usually, we recommend to always return the data on the lowest logical level (e.g. if you recorded separate IMU sessions per trial, you should provide access only, if there is just a single trail by a single participant left in the dataset). Otherwise, you should throw an error. This pattern can be simplified using the `is_single` or `assert_is_single` helper method. These helpers check based on the provided `groupby_cols` if there is really just a single group/row left with the given groupby settings. Let's say `data` can be accessed on either a `recording` or a `trail` level, and `segmented_stride_list` can only be accessed on a `trail` level. Then we could do something like this: .. GENERATED FROM PYTHON SOURCE LINES 303-334 .. code-block:: default class CustomDataset(Dataset): @property def data(self) -> str: # Note that we need to make our checks from the least restrictive to the most restrictive (if there is only a # single trail, there is only just a single recording). if self.is_single(["participant", "recording"]): return "This is the data for participant {} and rec {}".format(*self.group_label) # None -> single row if self.is_single(None): return "This is the data for participant {}, rec {} and trial {}".format(*self.group_label) raise ValueError( "Data can only be accessed when their is only a single recording of a single participant in the subset" ) @property def sampling_rate_hz(self) -> float: return 204.8 @property def segmented_stride_list_(self) -> str: # We use assert here, as we don't have multiple options. # (We could also used `None` for the `groupby_cols` here) self.assert_is_single(["participant", "recording", "trial"], "segmented_stride_list_") return "This is the segmented stride list for participant {}, rec {} and trial {}".format(*self.group_label) def create_index(self): return index .. GENERATED FROM PYTHON SOURCE LINES 335-336 If we select a single trial (row), we can get data and the stride list: .. GENERATED FROM PYTHON SOURCE LINES 336-341 .. code-block:: default test_dataset = CustomDataset() single_trial = test_dataset[0] print(single_trial.data) print(single_trial.segmented_stride_list_) .. rst-class:: sphx-glr-script-out .. code-block:: none This is the data for participant p1 and rec rec_1 This is the segmented stride list for participant p1, rec rec_1 and trial trial_1 .. GENERATED FROM PYTHON SOURCE LINES 342-343 If we only select a recording, we get an error for the stride list: .. GENERATED FROM PYTHON SOURCE LINES 343-352 .. code-block:: default # We select only recording 3 here, as it has 2 trials. single_recording = test_dataset.get_subset(recording="rec_3").groupby(["participant", "recording"])[0] print(single_recording.data) try: print(single_recording.segmented_stride_list_) except Exception as e: print("ValueError: ", e) .. rst-class:: sphx-glr-script-out .. code-block:: none This is the data for participant p1 and rec rec_3 ValueError: The attribute `segmented_stride_list_` of dataset CustomDataset can only be accessed if there is only a single combination of the columns ['participant', 'recording', 'trial'] left in a data subset, .. GENERATED FROM PYTHON SOURCE LINES 353-363 Custom parameter ---------------- Often it is required to pass some parameters/configuration to the dataset. This could be for example the place where the data is stored or if a specific part of the dataset should be included, if some preprocessing should be applied to the data, ... . Such additional configuration can be provided via a custom `__init__` and is then available for all methods to be used. Note that you **must** assign the configuration values to attributes with the same name and **must not** forget to call `super().__init__` .. GENERATED FROM PYTHON SOURCE LINES 363-384 .. code-block:: default class CustomDatasetWithConfig(Dataset): data_folder: str custom_config_para: bool def __init__( self, data_folder: str, custom_config_para: bool = False, *, groupby_cols: Optional[Union[list[str], str]] = None, subset_index: Optional[pd.DataFrame] = None, ): self.data_folder = data_folder self.custom_config_para = custom_config_para super().__init__(groupby_cols=groupby_cols, subset_index=subset_index) def create_index(self): # Use e.g. `self.data_folder` to load the data. return index .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 3.936 seconds) **Estimated memory usage:** 13 MB .. _sphx_glr_download_auto_examples_datasets__01_datasets_basics.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: _01_datasets_basics.py <_01_datasets_basics.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: _01_datasets_basics.ipynb <_01_datasets_basics.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_