{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Custom Dataset - Basics\n\nDatasets represent a set of recordings that should all be processed in the same way.\nFor example the data of multiple participants in a study, multiple days of recording, or multiple tests.\nThe goal of datasets is to provide a consistent interface to access the raw data, metadata, and potential reference\ninformation in an object-oriented way.\nIt is up to you to define, what is considered a single \"data-point\" for your dataset.\nNote, that datasets can be arbitrarily nested (e.g. multiple participants with multiple recordings).\n\nDatasets work best in combination with `Pipelines` and are further compatible with concepts like `GridSearch` and\n`cross_validation`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Defining your own dataset\nFundamentally you only need to create a subclass of :func:`~tpcp.dataset.Dataset` and define the\n`create_index` method.\nThis method should return a dataframe describing all the data-points that should be available in the dataset.\n\nIn the following we will create an example dataset, without any real world data,\nbut it can be used to demonstrate most functionality.\nAt the end we will discuss, how gait specific data should be integrated.\n\nWe will define an index that contains 5 participants, with 3 recordings each.\nRecording 3 has 2 trials, while the others have only one.\nNote, that we implement this as a static index here, but most of the time, you would create the index by e.g. scanning\nand listing the files in your data directory.\nIt is important that you don't want to load the entire actual data (e.g. the imu samples) in memory, but just list\nthe available data-points in the index.\nThen you can filter the dataset first and load the data once you know which data-points you want to access.\nWe will discuss this later in the example.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from itertools import product\nfrom typing import List, Optional, Union\n\nimport pandas as pd\n\ntrials = list(product((\"rec_1\", \"rec_2\", \"rec_3\"), (\"trial_1\",)))\ntrials.append((\"rec_3\", \"trial_2\"))\nindex = [(p, *t) for p, t in product((\"p{}\".format(i) for i in range(1, 6)), trials)]\nindex = pd.DataFrame(index, columns=[\"participant\", \"recording\", \"trial\"])\nindex"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we use this index as the index of our new dataset.\nTo see the dataset in action, we need to create an instance of it.\nIts string representation will show us the most important information.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tpcp._dataset import Dataset\n\n\nclass CustomDataset(Dataset):\n    def create_index(self):\n        return index\n\n\ndataset = CustomDataset()\ndataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Subsets\nWhen working with a dataset, the first thing is usually to select the data you want to use.\nFor this, you can primarily use the method `get_subset`.\nHere we want to select only recording 2 and 3 from participant 1 to 4.\nNote that the returned subset is an instance of your dataset class as well.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "subset = dataset.get_subset(participant=[\"p1\", \"p2\", \"p3\", \"p4\"], recording=[\"rec_2\", \"rec_3\"])\nsubset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The subset can then be filtered further.\nFor more advanced filter approaches you can also filter the index directly and use a bool-map to index the dataset\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "example_bool_map = subset.index[\"participant\"].isin([\"p1\", \"p2\"])\nfinal_subset = subset.get_subset(bool_map=example_bool_map)\nfinal_subset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Iteration and Groups\nAfter selecting the part of the data you want to use, you usually want/need to iterate over the data to apply your\nprocessing steps.\n\nBy default, you can simply iterate over all rows.\nNote, that each row itself is a dataset again, but just with a single entry.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for row in final_subset:\n    print(row)\n    print(\"This row contains {} data-point\".format(len(row)), end=\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "However, in many cases, we don't want to iterate over all rows, but rather iterate over groups of the datasets (\ne.g. all participants or all tests) individually.\nWe can do that in 2 ways (depending on what is needed).\nFor example, if we want to iterate over all recordings, we can do this:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for trial in final_subset.iter_level(\"recording\"):\n    print(trial, end=\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can see that we get two subsets, one for each recording label.\nBut what, if we want to iterate over the participants and the recordings together?\nIn these cases, we need to group our dataset first.\nNote that the grouped_subset shows the new groupby columns as the index in the representation and the length of the\ndataset is reported to be the number of groups.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "grouped_subset = final_subset.groupby([\"participant\", \"recording\"])\nprint(\"The dataset contains {} groups.\".format(len(grouped_subset)))\ngrouped_subset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we now iterate the dataset, it will iterate over the unique groups.\n\nGrouping also changes the meaning of a \"single datapoint\".\nEach group reports a shape of `(1,)` independent of the number of rows in each group.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for group in grouped_subset:\n    print(\"This group has the shape {}\".format(group.shape))\n    print(group, end=\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "At any point, you can view all unique groups/rows in the dataset using the `groups` attribute.\nThe order shown here, is the same order used when iterating the dataset.\nWhen creating a new subset, the order might change!\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "grouped_subset.groups"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that for an \"un-grouped\" dataset, this corresponds to all rows.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "final_subset.groups"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In both cases, we can use the group labels (or a subset of them) to index our dataset.\nThis can be in particular helpful, if you want to recreate specific train test splits provided by `cross_validate`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "final_subset.get_subset(groups=final_subset.groups[:3])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you want, you can also ungroup a dataset again.\nThis can be useful for a nested iteration:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for outer, group in enumerate(grouped_subset):\n    ungrouped = group.groupby(None)\n    for inner, subgroup in enumerate(ungrouped):\n        print(outer, inner)\n        print(subgroup, end=\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Splitting\nIf you are evaluating algorithms, it is often important to split your data into a train and a test set, or multiple\ndistinct sets for a cross validation.\n\nThe `Dataset` objects directly support the `sklearn` helper functions for this.\nFor example, to split our subset into training and testing we can do the following:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import train_test_split\n\ntrain, test = train_test_split(final_subset, train_size=0.5)\nprint(\"Train:\\n\", train, end=\"\\n\\n\")\nprint(\"Test:\\n\", test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Such splitting always occurs on a data-point level and can therefore be influenced by grouping.\nIf we want to split our datasets into training and testing, but only based on the participants, we can do this:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "train, test = train_test_split(final_subset.groupby(\"participant\"), train_size=0.5)\nprint(\"Train:\\n\", train, end=\"\\n\\n\")\nprint(\"Test:\\n\", test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the same way you can use the dataset (grouped or not) with the cross validation helper functions\n(KFold is just an example, all should work):\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import KFold\n\ncv = KFold(n_splits=2)\ngrouped_subset = final_subset.groupby(\"participant\")\nfor train, test in cv.split(grouped_subset):\n    # We only print the train set here\n    print(grouped_subset[train], end=\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "While this works well, it is not always what we want.\nSometimes, we still want to consider each row a single datapoint, but want to prevent that data of e.g. a single\nparticipant is partially put into train- and partially into the test-split.\nFor this, we can use `GroupKFold` in combination with `dataset.create_group_labels`.\n\n`create_group_labels` generates a unique identifier for each row/group:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "group_labels = final_subset.create_group_labels(\"participant\")\ngroup_labels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "They can then be used as the `group` parameter in `GroupKFold` (and similar methods).\nNow the data of the two participants is never split between train and test set.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import GroupKFold\n\ncv = GroupKFold(n_splits=2)\nfor train, test in cv.split(final_subset, groups=group_labels):\n    # We only print the train set here\n    print(final_subset[train], end=\"\\n\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Creating labels also works for datasets that are already grouped.\nBut, the columns that should be contained in the label must be a subset of the groupby columns in this case.\n\nThe number of group labels is 4 in this case, as there are only 4 groups after grouping the datset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "group_labels = final_subset.groupby([\"participant\", \"recording\"]).create_group_labels(\"participant\")\ngroup_labels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Adding Data\nSo far we only operated on the index of the dataset.\nBut if we want to run algorithms, we need the actual data (i.e. IMU samples, clinical data, ...).\n\nBecause the data and the structure of the data can vary widely from dataset to dataset, it is up to you to implement\ndata access.\nIt comes down to documentation to ensure that users access the data in the correct way.\n\nIn general we try to follow a couple of conventions to give datasets a consistent feel:\n\n- Data access should be provided via `@property` decorator on the dataset objects, loading the data on demand.\n- The names of these properties should follow some the naming scheme (e.g. `data` for the core sensor data)\n  and should return values using the established datatypes (e.g. `pd.DataFrames`).\n- The names of values that represent gold standard information (i.e. values you would only have in an evaluation\n  dataset and should never use for training), should have a trailing `_`, which marks them as result similar how\n  sklearn handles it.\n\nThis should look something like this:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class CustomDataset(Dataset):\n    @property\n    def data(self) -> pd.DataFrame:\n        # Some logic to load data from disc\n        raise NotImplementedError()\n\n    @property\n    def sampling_rate_hz(self) -> float:\n        return 204.8\n\n    @property\n    def reference_events_(self) -> pd.DataFrame:\n        # Some custom logic to load the gold-standard events of this validation dataset.\n        # Note the trailing `_` in the name.\n        raise NotImplementedError()\n\n    def create_index(self):\n        return index"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For each of the data-values you need to decide, on which \"level\" you provide data access.\nMeaning, do you want/can return data, when there are still multiple participants/recordings in the dataset, or can you\nonly return the data, when there is only a single trial of a single participant left.\n\nUsually, we recommend to always return the data on the lowest logical level (e.g. if you recorded separate IMU\nsessions per trial, you should provide access only, if there is just a single trail by a single participant left in\nthe dataset).\nOtherwise, you should throw an error.\nThis pattern can be simplified using the `is_single` or `assert_is_single` helper method.\nThese helpers check based on the provided `groupby_cols` if there is really just a single group/row left with the\ngiven groupby settings.\n\nLet's say `data` can be accessed on either a `recording` or a `trail` level, and `segmented_stride_list` can only\nbe accessed on a `trail` level.\nThen we could do something like this:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class CustomDataset(Dataset):\n    @property\n    def data(self) -> str:\n        # Note that we need to make our checks from the least restrictive to the most restrictive (if there is only a\n        # single trail, there is only just a single recording).\n        if self.is_single([\"participant\", \"recording\"]):\n            return \"This is the data for participant {} and rec {}\".format(*self.group)\n        # None -> single row\n        if self.is_single(None):\n            return \"This is the data for participant {}, rec {} and trial {}\".format(*self.group)\n        raise ValueError(\n            \"Data can only be accessed when their is only a single recording of a single participant in the subset\"\n        )\n\n    @property\n    def sampling_rate_hz(self) -> float:\n        return 204.8\n\n    @property\n    def segmented_stride_list_(self) -> str:\n        # We use assert here, as we don't have multiple options.\n        # (We could also used `None` for the `groupby_cols` here)\n        self.assert_is_single([\"participant\", \"recording\", \"trial\"], \"segmented_stride_list_\")\n        return \"This is the segmented stride list for participant {}, rec {} and trial {}\".format(*self.group)\n\n    def create_index(self):\n        return index"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we select a single trial (row), we can get data and the stride list:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "test_dataset = CustomDataset()\nsingle_trial = test_dataset[0]\nprint(single_trial.data)\nprint(single_trial.segmented_stride_list_)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we only select a recording, we get an error for the stride list:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# We select only recording 3 here, as it has 2 trials.\nsingle_recording = test_dataset.get_subset(recording=\"rec_3\").groupby([\"participant\", \"recording\"])[0]\nprint(single_recording.data)\ntry:\n    print(single_recording.segmented_stride_list_)\nexcept Exception as e:\n    print(\"ValueError: \", e)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Custom parameter\nOften it is required to pass some parameters/configuration to the dataset.\nThis could be for example the place where the data is stored or if a specific part of the dataset should be included,\nif some preprocessing should be applied to the data, ... .\n\nSuch additional configuration can be provided via a custom `__init__` and is then available for all methods to be\nused.\nNote that you **must** assign the configuration values to attributes with the same name and **must not** forget to\ncall `super().__init__`\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class CustomDatasetWithConfig(Dataset):\n    data_folder: str\n    custom_config_para: bool\n\n    def __init__(\n        self,\n        data_folder: str,\n        custom_config_para: bool = False,\n        *,\n        groupby_cols: Optional[Union[List[str], str]] = None,\n        subset_index: Optional[pd.DataFrame] = None,\n    ):\n        self.data_folder = data_folder\n        self.custom_config_para = custom_config_para\n        super().__init__(groupby_cols=groupby_cols, subset_index=subset_index)\n\n    def create_index(self):\n        # Use e.g. `self.data_folder` to load the data.\n        return index"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}