tpcp.validate.cross_validate#
- tpcp.validate.cross_validate(optimizable: tpcp._optimize.BaseOptimize, dataset: tpcp._dataset.Dataset, *, groups: Optional[List[Union[str, Tuple[str, ...]]]] = None, scoring: Optional[Callable] = None, cv: Optional[Union[int, sklearn.model_selection._split.BaseCrossValidator, Iterator]] = None, n_jobs: Optional[int] = None, verbose: int = 0, optimize_params: Optional[Dict[str, Any]] = None, propagate_groups: bool = True, pre_dispatch: Union[str, int] = '2*n_jobs', return_train_score: bool = False, return_optimizer: bool = False, error_score: Union[Literal['raise'], float] = nan, progress_bar: bool = True)[source]#
Evaluate a pipeline on a dataset using cross validation.
This function follows as much as possible the interface of
cross_validate
. If the tpcp documentation is missing some information, the respective documentation of sklearn might be helpful.- Parameters
- optimizable
A optimizable class instance like
GridSearch
/GridSearchCV
or aPipeline
wrapped in anOptimize
object (OptimizablePipeline
).- dataset
A
Dataset
containing all information.- groups
Group labels for samples used by the cross validation helper, in case a grouped CV is used (e.g.
GroupKFold
). Check the documentation of theDataset
class and the respective example for information on how to generate group labels for tpcp datasets.- scoring
A callable that can score a single data point given a pipeline. This function should return either a single score or a dictionary of scores. If scoring is
None
the defaultscore
method of the optimizable is used instead.- cv
An integer specifying the number of folds in a K-Fold cross validation or a valid cross validation helper. The default (
None
) will result in a 5-fold cross validation. For further inputs check thesklearn
documentation.- n_jobs
Number of jobs to run in parallel. One job is created per CV fold. The default (
None
) means 1 job at the time, hence, no parallel computing.- verbose
The verbosity level (larger number -> higher verbosity). At the moment this only effects
Parallel
.- optimize_params
Additional parameter that are forwarded to the
optimize
method.- propagate_groups
In case your optimizable is a cross validation based optimize (e.g.
GridSearchCv
) and you are using a grouped cross validation, you probably want to use the same grouped CV for the outer and the inner cross validation. Ifpropagate_groups
is True, the group labels belonging to the training of each fold are passed to theoptimize
method of the optimizable. This only has an effect ifgroups
are specified.- pre_dispatch
The number of jobs that should be pre dispatched. For an explanation see the documentation of
Parallel
.- return_train_score
If True the performance on the train score is returned in addition to the test score performance. Note, that this increases the runtime. If
True
, the fieldstrain_data_labels
,train_score
, andtrain_score_single
are available in the results.- return_optimizer
If the optimized instance of the input optimizable should be returned. If
True
, the fieldoptimizer
is available in the results.- error_score
Value to assign to the score if an error occurs during scoring. If set to ‘raise’, the error is raised. If a numeric value is given, a Warning is raised.
- progress_bar
True/False to enable/disable a
tqdm
progress bar.
- Returns
- result_dict
Dictionary with results. Each element is either a list or array of length
n_folds
. The dictionary can be directly passed into the pandas DataFrame constructor for a better representation.The following fields are in the results:
- test_score / test_{scorer-name}
The aggregated value of a score over all data-points. If a single score is used for scoring, then the generic name “score” is used. Otherwise, multiple columns with the name of the respective scorer exist.
- test_single_score / test_single_{scorer-name}
The individual scores per datapoint per fold. This is a list of values with the
len(train_set)
.- test_data_labels
A list of data labels of the train set in the order the single score values are provided. These can be used to associate the
single_score
values with a certain data-point.- train_score / train_{scorer-name}
Results for train set of each fold.
- train_single_score / train_single_{scorer-name}
Results for individual data points in the train set of each fold
- train_data_labels
The data labels for the train set.
- optimize_time
Time required to optimize the pipeline in each fold.
- score_time
Cumulative score time to score all data points in the test set.
- optimizer
The optimized instances per fold. One instance per fold is returned. The optimized version of the pipeline can be obtained via the
optimized_pipeline_
attribute on the instance.