tpcp
.Dataset#
- class tpcp.Dataset(*, groupby_cols: List[str] | str | None = None, subset_index: DataFrame | None = None)[source]#
Baseclass for tpcp Dataset objects.
This class provides fundamental functionality like iteration, getting subsets, and compatibility with
sklearn
’s cross validation helpers.For more information check out the examples and user guides on datasets.
- Parameters:
- groupby_cols
A column name or a list of column names that should be used to group the index before iterating over it. For examples see below.
- subset_index
For all classes that inherit from this class, subset_index must be None by default. But the subclasses require a
create_index
method that returns a DataFrame representing the index.
- Attributes:
index
Get index.
grouped_index
Return the index with the
groupby
columns set as multiindex.groups
Get all groups based on the set groupby level.
shape
Get the shape of the dataset.
Examples
This class is usually not meant to be used directly, but the following code snippets show some common operations that can be expected to work for all dataset subclasses.
>>> import pandas as pd >>> from itertools import product >>> >>> from tpcp import Dataset >>> >>> test_index = pd.DataFrame( ... list(product(("patient_1", "patient_2", "patient_3"), ("test_1", "test_2"), ("1", "2"))), ... columns=["patient", "test", "extra"], ... ) >>> # We create a little dummy dataset by passing an index directly to `test_index` >>> # Usually we would create a subclass with a `create_index` method that returns a DataFrame representing the >>> # index. >>> dataset = Dataset(subset_index=test_index) >>> dataset Dataset [12 groups/rows] patient test extra 0 patient_1 test_1 1 1 patient_1 test_1 2 2 patient_1 test_2 1 3 patient_1 test_2 2 4 patient_2 test_1 1 5 patient_2 test_1 2 6 patient_2 test_2 1 7 patient_2 test_2 2 8 patient_3 test_1 1 9 patient_3 test_1 2 10 patient_3 test_2 1 11 patient_3 test_2 2
We can loop over the dataset. By default, we will loop over each row.
>>> for r in dataset[:2]: ... print(r) Dataset [1 groups/rows] patient test extra 0 patient_1 test_1 1 Dataset [1 groups/rows] patient test extra 0 patient_1 test_1 2
We can also change
groupby
(either in the init or afterwards), to loop over other combinations. If we select the leveltest
, we will loop over allpatient
-test
combinations.>>> grouped_dataset = dataset.groupby(["patient", "test"]) >>> grouped_dataset Dataset [6 groups/rows] patient test extra patient test patient_1 test_1 patient_1 test_1 1 test_1 patient_1 test_1 2 test_2 patient_1 test_2 1 test_2 patient_1 test_2 2 patient_2 test_1 patient_2 test_1 1 test_1 patient_2 test_1 2 test_2 patient_2 test_2 1 test_2 patient_2 test_2 2 patient_3 test_1 patient_3 test_1 1 test_1 patient_3 test_1 2 test_2 patient_3 test_2 1 test_2 patient_3 test_2 2
>>> for r in grouped_dataset[:2]: ... print(r) Dataset [1 groups/rows] patient test extra patient test patient_1 test_1 patient_1 test_1 1 test_1 patient_1 test_1 2 Dataset [1 groups/rows] patient test extra patient test patient_1 test_2 patient_1 test_2 1 test_2 patient_1 test_2 2
To iterate over the unique values of a specific level use the “iter_level” function:
>>> for r in list(grouped_dataset.iter_level("patient"))[:2]: ... print(r) Dataset [2 groups/rows] patient test extra patient test patient_1 test_1 patient_1 test_1 1 test_1 patient_1 test_1 2 test_2 patient_1 test_2 1 test_2 patient_1 test_2 2 Dataset [2 groups/rows] patient test extra patient test patient_2 test_1 patient_2 test_1 1 test_1 patient_2 test_1 2 test_2 patient_2 test_2 1 test_2 patient_2 test_2 2
We can also get arbitary subsets from the dataset:
>>> subset = grouped_dataset.get_subset(patient=["patient_1", "patient_2"], extra="2") >>> subset Dataset [4 groups/rows] patient test extra patient test patient_1 test_1 patient_1 test_1 2 test_2 patient_1 test_2 2 patient_2 test_1 patient_2 test_1 2 test_2 patient_2 test_2 2
If we want to use datasets in combination with
GroupKFold
, we can generate valid group labels as follows. These grouplabels are strings representing the unique value of the index at the specified levels.Note
You usually don’t want to use that in combination with
self.groupby
.>>> # We are using the ungrouped dataset again! >>> group_labels = dataset.create_group_labels(["patient", "test"]) >>> pd.concat([dataset.index, pd.Series(group_labels, name="group_labels")], axis=1) patient test extra group_labels 0 patient_1 test_1 1 ('patient_1', 'test_1') 1 patient_1 test_1 2 ('patient_1', 'test_1') 2 patient_1 test_2 1 ('patient_1', 'test_2') 3 patient_1 test_2 2 ('patient_1', 'test_2') 4 patient_2 test_1 1 ('patient_2', 'test_1') 5 patient_2 test_1 2 ('patient_2', 'test_1') 6 patient_2 test_2 1 ('patient_2', 'test_2') 7 patient_2 test_2 2 ('patient_2', 'test_2') 8 patient_3 test_1 1 ('patient_3', 'test_1') 9 patient_3 test_1 2 ('patient_3', 'test_1') 10 patient_3 test_2 1 ('patient_3', 'test_2') 11 patient_3 test_2 2 ('patient_3', 'test_2')
Methods
as_attrs
()Return a version of the Dataset class that can be subclassed using
attrs
defined classes.Return a version of the Dataset class that can be subclassed using dataclasses.
assert_is_single
(groupby_cols, property_name)Raise error if index does contain more than one group/row with the given groupby settings.
assert_is_single_group
(property_name)Raise error if index does contain more than one group/row.
clone
()Create a new instance of the class with all parameters copied over.
create_group_labels
(label_cols)Generate a list of labels for each group/row in the dataset.
Create the full index for the dataset.
get_params
([deep])Get parameters for this algorithm.
get_subset
(*[, groups, index, bool_map])Get a subset of the dataset.
groupby
(groupby_cols)Return a copy of the dataset grouped by the specified columns.
is_single
(groupby_cols)Return True if index contains only one row/group with the given groupby settings.
Return True if index contains only one group.
iter_level
(level)Return generator object containing a subset for every category from the selected level.
set_params
(**params)Set the parameters of this Algorithm.
- __init__(*, groupby_cols: List[str] | str | None = None, subset_index: DataFrame | None = None)[source]#
- _create_check_index()[source]#
Check the index creation.
We create the index twice to check if the index creation is deterministic. If not we raise an error. This is fundamentally important for datasets to be deterministic. While we can not catch all related issues (i.e. determinism across different machines), this should catch the most obvious ones.
In case, creating the index twice is too expensive, users can overwrite this method. But better to catch errors early.
- static as_attrs()[source]#
Return a version of the Dataset class that can be subclassed using
attrs
defined classes.Note, this requires
attrs
to be installed!
- static as_dataclass()[source]#
Return a version of the Dataset class that can be subclassed using dataclasses.
- assert_is_single(groupby_cols: List[str] | str | None, property_name) None [source]#
Raise error if index does contain more than one group/row with the given groupby settings.
This should be used when implementing access to data values, which can only be accessed when only a single trail/participant/etc. exist in the dataset.
- Parameters:
- groupby_cols
None (no grouping) or a valid subset of the columns available in the dataset index.
- property_name
Name of the property this check is used in. Used to format the error message.
- assert_is_single_group(property_name) None [source]#
Raise error if index does contain more than one group/row.
Note that this is different from
assert_is_single
as it is aware of the current grouping. Instead of checking that a certain combination of columns is left in the dataset, it checks that only a single group exists with the already selected grouping as defined byself.groupby_cols
.- Parameters:
- property_name
Name of the property this check is used in. Used to format the error message.
- clone() Self [source]#
Create a new instance of the class with all parameters copied over.
This will create a new instance of the class itself and all nested objects
- create_group_labels(label_cols: str | List[str]) List[str] [source]#
Generate a list of labels for each group/row in the dataset.
Note
This has a different use case than the dataset-wide groupby. Using
groupby
reduces the effective size of the dataset to the number of groups. This method produces a group label for each group/row that is already in the dataset, without changing the dataset.The output of this method can be used in combination with
GroupKFold
as the group label.- Parameters:
- label_cols
The columns that should be included in the label. If the dataset is already grouped, this must be a subset of
self.groupby_cols
.
- create_index() DataFrame [source]#
Create the full index for the dataset.
This needs to be implemented by the subclass.
Warning
Make absolutely sure that the dataframe you return is deterministic and does not change between runs! This can lead to some nasty bugs! We try to catch them internally, but it is not always possible. As tips, avoid reliance on random numbers and make sure that the order is not depend on things like file system order, when creating an index by scanning a directory. Particularly nasty are cases when using non-sorted container like
set
, that sometimes maintain their order, but sometimes don’t. At the very least, we recommend to sort the final dataframe you return increate_index
.
- get_params(deep: bool = True) Dict[str, Any] [source]#
Get parameters for this algorithm.
- Parameters:
- deep
Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like
nested_object_name__
(Note the two “_” at the end)
- Returns:
- params
Parameter names mapped to their values.
- get_subset(*, groups: List[str | Tuple[str, ...]] | None = None, index: DataFrame | None = None, bool_map: Sequence[bool] | None = None, **kwargs: List[str] | str) Self [source]#
Get a subset of the dataset.
Note
All arguments are mutable exclusive!
- Parameters:
- groups
A valid row locator or slice that can be passed to
self.grouped_index.loc[locator, :]
. This basically needs to be a subset ofself.groups
. Note that this is the only indexer that works on the grouped index. All other indexers work on the pure index.- index
pd.DataFrame
that is a valid subset of the current dataset index.- bool_map
bool-map that is used to index the current index-dataframe. The list must be of same length as the number of rows in the index.
- **kwargs
The key must be the name of an index column. The value is a list containing strings that correspond to the categories that should be kept. For examples see above.
- Returns:
- subset
New dataset object filtered by specified parameters.
- groupby(groupby_cols: List[str] | str | None) Self [source]#
Return a copy of the dataset grouped by the specified columns.
Each unique group represents a single data point in the resulting dataset.
- Parameters:
- groupby_cols
None (no grouping) or a valid subset of the columns available in the dataset index.
- is_single(groupby_cols: List[str] | str | None) bool [source]#
Return True if index contains only one row/group with the given groupby settings.
If
groupby_cols=None
this checks if there is only a single row left. If you want to check if there is only a single group within the current grouping, useis_single_group
instead.- Parameters:
- groupby_cols
None (no grouping) or a valid subset of the columns available in the dataset index.
- iter_level(level: str) Iterator[Self] [source]#
Return generator object containing a subset for every category from the selected level.
- Parameters:
- level
Optional
str
that sets the level which shall be used for iterating. This must be one of the columns names of the index.
- Returns:
- subset
New dataset object containing only one category in the specified
level
.
Examples using tpcp.Dataset
#
Custom Dataset - A real world example
Algorithms - A real world example: QRS-Detection
Grid Search optimal Algorithm Parameter