{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n\n# Custom Dataset - Basics\n\nDatasets represent a set of recordings that should all be processed in the same way.\nFor example the data of multiple participants in a study, multiple days of recording, or multiple tests.\nThe goal of datasets is to provide a consistent interface to access the raw data, metadata, and potential reference\ninformation in an object-oriented way.\nIt is up to you to define, what is considered a single \"data-point\" for your dataset.\nNote, that datasets can be arbitrarily nested (e.g. multiple participants with multiple recordings).\n\nDatasets work best in combination with `Pipelines` and are further compatible with concepts like `GridSearch` and\n`cross_validation`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining your own dataset\nFundamentally you only need to create a subclass of :func:`~tpcp.Dataset` and define the\n`create_index` method.\nThis method should return a dataframe describing all the data-points that should be available in the dataset.\n\n
Make absolutely sure that the dataframe you return is deterministic and does not change between runs!\n This can lead to some nasty bugs!\n We try to catch them internally, but it is not always possible.\n As tips, avoid reliance on random numbers and make sure that the order is not depend on things\n like file system order, when creating an index by scanning a directory.\n Particularly nasty are cases when using non-sorted container like `set`, that sometimes maintain\n their order, but sometimes don't.\n At the very least, we recommend to sort the final dataframe you return in `create_index`.
The `group_labels` attribute consists of a list of [named tuples](https://docs.python.org/3/library/collections.html#\n namedtuple-factory-function-for-tuples-with-named-fields).\n The tuple elements are named after the groupby columns and are in the same order as the groupby columns.\n They can be accessed by name or index:\n For example, `grouped_subset.group_labels[0].participant` and `grouped_subset.group_labels[0][0]` are equivalent.\n\n Also, `grouped_subset.group_labels[0]` and `grouped_subset[0].group_label` are equivalent.