DatasetSplitter#

class tpcp.validate.DatasetSplitter(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None)[source]#

Wrapper around sklearn cross-validation splitters to support grouping and stratification with tpcp-Datasets.

This wrapper can be used instead of a sklearn-style splitter with all methods that support a cv parameter. Whenever you want to do complicated cv-logic (like grouping or stratification’s), this wrapper is the way to go.

Warning

We don’t validate if the selected base_splitter does anything useful with the provided groupby and stratify information. This wrapper just ensures, that the information is correctly extracted from the dataset and passed to the split method of the base_splitter. So if you are using a normal KFold splitter, the groupby and stratify arguments will have no effect.

Parameters:
base_splitter

The base splitter to use. Can be an integer (for KFold), an iterator, or any other valid sklearn-splitter. The default is None, which will use the sklearn default KFold splitter with 5 splits.

groupby

The column(s) to group by. If None, no grouping is done. Must be a subset of the columns in the dataset.

This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the groups parameter. It is up to the base splitter to decide what to do with the generated labels.

stratify

The column(s) to stratify by. If None, no stratification is done. Must be a subset of the columns in the dataset.

This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the y parameter, acting as “mock” target labels, as sklearn only support stratification on classification outcome targets. It is up to the base splitter to decide what to do with the generated labels.

Methods

clone()

Create a new instance of the class with all parameters copied over.

get_n_splits(dataset)

Get the number of splits.

get_params([deep])

Get parameters for this algorithm.

set_params(**params)

Set the parameters of this Algorithm.

split(dataset)

Split the dataset into train and test sets.

__init__(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None)[source]#
clone() Self[source]#

Create a new instance of the class with all parameters copied over.

This will create a new instance of the class itself and all nested objects

get_n_splits(dataset: Dataset) int[source]#

Get the number of splits.

get_params(deep: bool = True) dict[str, Any][source]#

Get parameters for this algorithm.

Parameters:
deep

Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like nested_object_name__ (Note the two “_” at the end)

Returns:
params

Parameter names mapped to their values.

set_params(**params: Any) Self[source]#

Set the parameters of this Algorithm.

To set parameters of nested objects use nested_object_name__para_name=.

split(dataset: Dataset) Iterator[tuple[list[int], list[int]]][source]#

Split the dataset into train and test sets.

Examples using tpcp.validate.DatasetSplitter#

Custom Dataset - Basics

Custom Dataset - Basics

Advanced cross-validation

Advanced cross-validation