DatasetSplitter#
- class tpcp.validate.DatasetSplitter(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None)[source]#
Wrapper around sklearn cross-validation splitters to support grouping and stratification with tpcp-Datasets.
This wrapper can be used instead of a sklearn-style splitter with all methods that support a
cv
parameter. Whenever you want to do complicated cv-logic (like grouping or stratification’s), this wrapper is the way to go.Warning
We don’t validate if the selected
base_splitter
does anything useful with the providedgroupby
andstratify
information. This wrapper just ensures, that the information is correctly extracted from the dataset and passed to thesplit
method of thebase_splitter
. So if you are using a normalKFold
splitter, thegroupby
andstratify
arguments will have no effect.- Parameters:
- base_splitter
The base splitter to use. Can be an integer (for
KFold
), an iterator, or any other valid sklearn-splitter. The default is None, which will use the sklearn defaultKFold
splitter with 5 splits.- groupby
The column(s) to group by. If None, no grouping is done. Must be a subset of the columns in the dataset.
This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the
groups
parameter. It is up to the base splitter to decide what to do with the generated labels.- stratify
The column(s) to stratify by. If None, no stratification is done. Must be a subset of the columns in the dataset.
This will generate a set of unique string labels with the same shape as the dataset. This will passed to the base splitter as the
y
parameter, acting as “mock” target labels, as sklearn only support stratification on classification outcome targets. It is up to the base splitter to decide what to do with the generated labels.
Methods
clone
()Create a new instance of the class with all parameters copied over.
get_n_splits
(dataset)Get the number of splits.
get_params
([deep])Get parameters for this algorithm.
set_params
(**params)Set the parameters of this Algorithm.
split
(dataset)Split the dataset into train and test sets.
- __init__(base_splitter: int | BaseCrossValidator | Iterator | None = None, *, groupby: list[str] | str | None = None, stratify: list[str] | str | None = None)[source]#
- clone() Self [source]#
Create a new instance of the class with all parameters copied over.
This will create a new instance of the class itself and all nested objects
- get_params(deep: bool = True) dict[str, Any] [source]#
Get parameters for this algorithm.
- Parameters:
- deep
Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like
nested_object_name__
(Note the two “_” at the end)
- Returns:
- params
Parameter names mapped to their values.