DatasetSplitter#
- class tpcp.validate.DatasetSplitter(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None, ignore_potentially_invalid_splitter_warning: bool = False)[source]#
Wrapper around sklearn cross-validation splitters to support grouping and stratification with tpcp-Datasets.
This wrapper can be used instead of a sklearn-style splitter with all methods that support a
cvparameter. Whenever you want to do complicated cv-logic (like grouping or stratification’s), this wrapper is the way to go.You can either select your own base splitter, or we will select from KFold, StratifiedKFold, GroupKFold, or StratifiedGroupKFold, depending on the provided
groupbyandstratifyparameters.Warning
If you use a custom splitter, that does not support grouping or stratification, these parameters might be silently ignored.
- Parameters:
- base_splitter
The base splitter to use. Can be an integer (for
KFold), an iterator, or any other valid sklearn-splitter. The default is None, which will use the sklearn defaultKFoldsplitter with 5 splits.- groupby
The column(s) to group by. If None, no grouping is done. Must be a subset of the columns in the dataset.
This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the
groupsparameter. It is up to the base splitter to decide what to do with the generated labels.- stratify
The column(s) to stratify by. If None, no stratification is done. Must be a subset of the columns in the dataset.
This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the
yparameter, acting as “mock” target labels, as sklearn only support stratification on classification outcome targets. It is up to the base splitter to decide what to do with the generated labels.- ignore_potentially_invalid_splitter_warning
We are trying to detect if the provided splitter supports grouping and stratification. If they are not supported, but you provided groupby or stratify columns, we will warn you. Note, that this warning is not a perfect check, as it is not possible to detect all cases. If you know what you are doing, and you want to disable this warning, set this parameter to True.
Methods
clone()Create a new instance of the class with all parameters copied over.
get_n_splits(dataset)Get the number of splits.
get_params([deep])Get parameters for this algorithm.
set_params(**params)Set the parameters of this Algorithm.
split(dataset)Split the dataset into train and test sets.
- __init__(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None, ignore_potentially_invalid_splitter_warning: bool = False)[source]#
- clone() Self[source]#
Create a new instance of the class with all parameters copied over.
This will create a new instance of the class itself and all nested objects
- get_params(deep: bool = True) dict[str, Any][source]#
Get parameters for this algorithm.
- Parameters:
- deep
Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like
nested_object_name__(Note the two “_” at the end)
- Returns:
- params
Parameter names mapped to their values.