DatasetSplitter#

Wrapper around sklearn cross-validation splitters to support grouping and stratification with tpcp-Datasets.

This wrapper can be used instead of a sklearn-style splitter with all methods that support a cv parameter. Whenever you want to do complicated cv-logic (like grouping or stratification’s), this wrapper is the way to go.

You can either select your own base splitter, or we will select from KFold, StratifiedKFold, GroupKFold, or StratifiedGroupKFold, depending on the provided groupby and stratify parameters.

Warning

If you use a custom splitter, that does not support grouping or stratification, these parameters might be silently ignored.

Parameters:

base_splitter

The base splitter to use. Can be an integer (for KFold), an iterator, or any other valid sklearn-splitter. The default is None, which will use the sklearn default KFold splitter with 5 splits.

groupby

The column(s) to group by. If None, no grouping is done. Must be a subset of the columns in the dataset.

This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the groups parameter. It is up to the base splitter to decide what to do with the generated labels.

stratify

The column(s) to stratify by. If None, no stratification is done. Must be a subset of the columns in the dataset.

This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the y parameter, acting as “mock” target labels, as sklearn only support stratification on classification outcome targets. It is up to the base splitter to decide what to do with the generated labels.

ignore_potentially_invalid_splitter_warning

We are trying to detect if the provided splitter supports grouping and stratification. If they are not supported, but you provided groupby or stratify columns, we will warn you. Note, that this warning is not a perfect check, as it is not possible to detect all cases. If you know what you are doing, and you want to disable this warning, set this parameter to True.

Methods

`clone`()	Create a new instance of the class with all parameters copied over.
`get_n_splits`(dataset)	Get the number of splits.
`get_params`([deep])	Get parameters for this algorithm.
`set_params`(**params)	Set the parameters of this Algorithm.
`split`(dataset)	Split the dataset into train and test sets.

__init__(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None, ignore_potentially_invalid_splitter_warning: bool = False)[source]#

clone() → Self[source]#

Create a new instance of the class with all parameters copied over.

This will create a new instance of the class itself and all nested objects

get_n_splits(dataset: Dataset) → int[source]#: Get the number of splits.

get_params(deep: bool = True) → dict[str, Any][source]#

Get parameters for this algorithm.

Parameters:

deep: Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like nested_object_name__ (Note the two “_” at the end)

Returns: