tpcp.Dataset#

class tpcp.Dataset(*, groupby_cols: Optional[Union[List[str], str]] = None, subset_index: Optional[pandas.core.frame.DataFrame] = None)[source]#

Baseclass for tpcp Dataset objects.

This class provides fundamental functionality like iteration, getting subsets, and compatibility with sklearn’s cross validation helpers.

For more information check out the examples and user guides on datasets.

Parameters
groupby_cols

A column name or a list of column names that should be used to group the index before iterating over it. For examples see below.

subset_index

For all classes that inherit from this class, subset_index must be None. The subset_index must be created in the method __create_index. If the base class is used, then the index the dataset should represent must be a Dataframe containig the index. For examples see below.

Attributes
index

Get index.

grouped_index

Return the index with the groupby columns set as multiindex.

groups

Get all groups based on the set groupby level.

shape

Get the shape of the dataset.

Examples

This class is usually not meant to be used directly, but the following code snippets show some common operations that can be expected to work for all dataset subclasses.

>>> import pandas as pd
>>> from itertools import product
>>>
>>> from tpcp import Dataset
>>>
>>> test_index = pd.DataFrame(
...     list(product(("patient_1", "patient_2", "patient_3"), ("test_1", "test_2"), ("1", "2"))),
...     columns=["patient", "test", "extra"],
... )
>>> # We create a little dummy dataset by passing an index directly to `test_index`
>>> dataset = Dataset(subset_index=test_index)
>>> dataset
Dataset [12 groups/rows]

         patient    test extra
   0   patient_1  test_1     1
   1   patient_1  test_1     2
   2   patient_1  test_2     1
   3   patient_1  test_2     2
   4   patient_2  test_1     1
   5   patient_2  test_1     2
   6   patient_2  test_2     1
   7   patient_2  test_2     2
   8   patient_3  test_1     1
   9   patient_3  test_1     2
   10  patient_3  test_2     1
   11  patient_3  test_2     2

We can loop over the dataset. By default, we will loop over each row.

>>> for r in dataset[:2]:
...     print(r)
Dataset [1 groups/rows]

        patient    test extra
   0  patient_1  test_1     1
Dataset [1 groups/rows]

        patient    test extra
   0  patient_1  test_1     2

We can also change groupby (either in the init or afterwards), to loop over other combinations. If we select the level test, we will loop over all patient-test combinations.

>>> grouped_dataset = dataset.groupby(["patient", "test"])
>>> grouped_dataset  
Dataset [6 groups/rows]

                       patient    test extra
   patient   test
   patient_1 test_1  patient_1  test_1     1
             test_1  patient_1  test_1     2
             test_2  patient_1  test_2     1
             test_2  patient_1  test_2     2
   patient_2 test_1  patient_2  test_1     1
             test_1  patient_2  test_1     2
             test_2  patient_2  test_2     1
             test_2  patient_2  test_2     2
   patient_3 test_1  patient_3  test_1     1
             test_1  patient_3  test_1     2
             test_2  patient_3  test_2     1
             test_2  patient_3  test_2     2
>>> for r in grouped_dataset[:2]:
...     print(r)  
Dataset [1 groups/rows]

                       patient    test extra
   patient   test
   patient_1 test_1  patient_1  test_1     1
             test_1  patient_1  test_1     2
Dataset [1 groups/rows]

                       patient    test extra
   patient   test
   patient_1 test_2  patient_1  test_2     1
             test_2  patient_1  test_2     2

To iterate over the unique values of a specific level use the “iter_level” function:

>>> for r in list(grouped_dataset.iter_level("patient"))[:2]:
...     print(r)  
Dataset [2 groups/rows]

                       patient    test extra
   patient   test
   patient_1 test_1  patient_1  test_1     1
             test_1  patient_1  test_1     2
             test_2  patient_1  test_2     1
             test_2  patient_1  test_2     2
Dataset [2 groups/rows]

                       patient    test extra
   patient   test
   patient_2 test_1  patient_2  test_1     1
             test_1  patient_2  test_1     2
             test_2  patient_2  test_2     1
             test_2  patient_2  test_2     2

We can also get arbitary subsets from the dataset:

>>> subset = grouped_dataset.get_subset(patient=["patient_1", "patient_2"], extra="2")
>>> subset  
Dataset [4 groups/rows]

                       patient    test extra
   patient   test
   patient_1 test_1  patient_1  test_1     2
             test_2  patient_1  test_2     2
   patient_2 test_1  patient_2  test_1     2
             test_2  patient_2  test_2     2

If we want to use datasets in combination with GroupKFold, we can generate valid group labels as follows.

Note

You usually don’t want to use that in combination with self.groupby.

>>> # We are using the ungrouped dataset again!
>>> group_labels = dataset.create_group_labels(["patient", "test"])
>>> pd.concat([dataset.index, pd.Series(group_labels, name="group_labels")], axis=1)
      patient    test extra         group_labels
0   patient_1  test_1     1  (patient_1, test_1)
1   patient_1  test_1     2  (patient_1, test_1)
2   patient_1  test_2     1  (patient_1, test_2)
3   patient_1  test_2     2  (patient_1, test_2)
4   patient_2  test_1     1  (patient_2, test_1)
5   patient_2  test_1     2  (patient_2, test_1)
6   patient_2  test_2     1  (patient_2, test_2)
7   patient_2  test_2     2  (patient_2, test_2)
8   patient_3  test_1     1  (patient_3, test_1)
9   patient_3  test_1     2  (patient_3, test_1)
10  patient_3  test_2     1  (patient_3, test_2)
11  patient_3  test_2     2  (patient_3, test_2)

Methods

assert_is_single(groupby_cols, property_name)

Raise error if index does contain more than one group/row with the given groupby settings.

clone()

Create a new instance of the class with all parameters copied over.

create_group_labels(label_cols)

Generate a list of labels for each group/row in the dataset.

create_index()

Create the full index for the dataset.

get_params([deep])

Get parameters for this algorithm.

get_subset(*[, groups, index, bool_map])

Get a subset of the dataset.

groupby(groupby_cols)

Return a copy of the dataset grouped by the specified columns.

is_single(groupby_cols)

Return True if index contains only one row/group with the given groupby settings.

iter_level(level)

Return generator object containing a subset for every category from the selected level.

set_params(**params)

Set the parameters of this Algorithm.

__init__(*, groupby_cols: Optional[Union[List[str], str]] = None, subset_index: Optional[pandas.core.frame.DataFrame] = None)[source]#
_repr_html_() str[source]#

Return html representation of the dataset object.

assert_is_single(groupby_cols: Optional[Union[List[str], str]], property_name) None[source]#

Raise error if index does contain more than one group/row with the given groupby settings.

This should be used when implementing access to data values, which can only be accessed when only a single trail/participant/etc. exist in the dataset.

Parameters
groupby_cols

None (no grouping) or a valid subset of the columns available in the dataset index.

property_name

Name of the property this check is used in. Used to format the error message.

clone() typing_extensions.Self[source]#

Create a new instance of the class with all parameters copied over.

This will create a new instance of the class itself and all nested objects

create_group_labels(label_cols: Union[str, List[str]])[source]#

Generate a list of labels for each group/row in the dataset.

Note

This has a different use case than the dataset-wide groupby. Using groupby reduces the effective size of the dataset to the number of groups. This method produces a group label for each group/row that is already in the dataset, without changing the dataset.

The output of this method can be used in combination with GroupKFold as the group label.

Parameters
label_cols

The columns that should be included in the label. If the dataset is already grouped, this must be a subset of self.groupby_cols.

create_index() pandas.core.frame.DataFrame[source]#

Create the full index for the dataset.

This needs to be implemented by the subclass.

get_params(deep: bool = True) Dict[str, Any][source]#

Get parameters for this algorithm.

Parameters
deep

Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like nested_object_name__ (Note the two “_” at the end)

Returns
params

Parameter names mapped to their values.

get_subset(*, groups: Optional[List[Union[str, Tuple[str, ...]]]] = None, index: Optional[pandas.core.frame.DataFrame] = None, bool_map: Optional[Sequence[bool]] = None, **kwargs: Union[List[str], str]) typing_extensions.Self[source]#

Get a subset of the dataset.

Note

All arguments are mutable exclusive!

Parameters
groups

A valid row locator or slice that can be passed to self.grouped_index.loc[locator, :]. This basically needs to be a subset of self.groups. Note that this is the only indexer that works on the grouped index. All other indexers work on the pure index.

index

pd.DataFrame that is a valid subset of the current dataset index.

bool_map

bool-map that is used to index the current index-dataframe. The list must be of same length as the number of rows in the index.

**kwargs

The key must be the name of an index column. The value is a list containing strings that correspond to the categories that should be kept. For examples see above.

Returns
subset

New dataset object filtered by specified parameters.

groupby(groupby_cols: Optional[Union[List[str], str]]) typing_extensions.Self[source]#

Return a copy of the dataset grouped by the specified columns.

Each unique group represents a single data point in the resulting dataset.

Parameters
groupby_cols

None (no grouping) or a valid subset of the columns available in the dataset index.

is_single(groupby_cols: Optional[Union[List[str], str]]) bool[source]#

Return True if index contains only one row/group with the given groupby settings.

If groupby_cols=None this checks if there is only a single row left.

Parameters
groupby_cols

None (no grouping) or a valid subset of the columns available in the dataset index.

iter_level(level: str) Iterator[typing_extensions.Self][source]#

Return generator object containing a subset for every category from the selected level.

Parameters
level

Optional str that sets the level which shall be used for iterating. This must be one of the columns names of the index.

Returns
subset

New dataset object containing only one category in the specified level.

set_params(**params: Any) typing_extensions.Self[source]#

Set the parameters of this Algorithm.

To set parameters of nested objects use nested_object_name__para_name=.

Examples using tpcp.Dataset#