cross_validate#
- tpcp.validate.cross_validate(optimizable: BaseOptimize, dataset: Dataset, *, groups: list[str | tuple[str, ...]] | None = None, mock_labels: list[str | tuple[str, ...]] | None = None, scoring: Callable | None = None, cv: int | BaseCrossValidator | Iterator | None = None, n_jobs: int | None = None, verbose: int = 0, optimize_params: dict[str, Any] | None = None, propagate_groups: bool = True, propagate_mock_labels: bool = True, pre_dispatch: str | int = '2*n_jobs', return_train_score: bool = False, return_optimizer: bool = False, progress_bar: bool = True)[source]#
Evaluate a pipeline on a dataset using cross validation.
This function follows as much as possible the interface of
cross_validate
. If the tpcp documentation is missing some information, the respective documentation of sklearn might be helpful.- Parameters:
- optimizable
A optimizable class instance like
GridSearch
/GridSearchCV
or aPipeline
wrapped in anOptimize
object (OptimizablePipeline
).- dataset
A
Dataset
containing all information.- groups
Group labels for samples used by the cross validation helper, in case a grouped CV is used (e.g.
GroupKFold
). Check the documentation of theDataset
class and the respective example for information on how to generate group labels for tpcp datasets.The groups will be passed to the optimizers
optimize
method under the same name, ifpropagate_groups
is True.- mock_labels
The value of
mock_labels
is passed as they
parameter to the cross-validation helper’ssplit
method. This can be helpful, if you want to use stratified cross validation. Usually, the stratified CV classes usey
(i.e. the label) to stratify the data. However, in tpcp, we don’t have a dedicatedy
as data and labels are both stored in a single datastructure. If you want to stratify the data (e.g. based on patient cohorts), you can create your own list of labels/groups that should be used for stratification and pass it tomock_labels
instead.The labels will be passed to the optimizers
optimize
method under the same name, ifpropagate_mock_labels
is True (similar to how groups are handled).- scoring
A callable that can score a single data point given a pipeline. This function should return either a single score or a dictionary of scores. If scoring is
None
the defaultscore
method of the optimizable is used instead.- cv
An integer specifying the number of folds in a K-Fold cross validation or a valid cross validation helper. The default (
None
) will result in a 5-fold cross validation. For further inputs check thesklearn
documentation.- n_jobs
Number of jobs to run in parallel. One job is created per CV fold. The default (
None
) means 1 job at the time, hence, no parallel computing.- verbose
The verbosity level (larger number -> higher verbosity). At the moment this only effects
Parallel
.- optimize_params
Additional parameter that are forwarded to the
optimize
method.- propagate_groups
In case your optimizable is a cross validation based optimize (e.g.
GridSearchCv
) and you are using a grouped cross validation, you probably want to use the same grouped CV for the outer and the inner cross validation. Ifpropagate_groups
is True, the group labels belonging to the training of each fold are passed to theoptimize
method of the optimizable. This only has an effect ifgroups
are specified.- propagate_mock_labels
For the same reason as
propagate_groups
, you might also want to forward the value provided formock_labels
to the optimization workflow.- pre_dispatch
The number of jobs that should be pre dispatched. For an explanation see the documentation of
Parallel
.- return_train_score
If True the performance on the train score is returned in addition to the test score performance. Note, that this increases the runtime. If
True
, the fieldstrain_data_labels
,train_score
, andtrain_score_single
are available in the results.- return_optimizer
If the optimized instance of the input optimizable should be returned. If
True
, the fieldoptimizer
is available in the results.- progress_bar
True/False to enable/disable a
tqdm
progress bar.
- Returns:
- result_dict
Dictionary with results. Each element is either a list or array of length
n_folds
. The dictionary can be directly passed into the pandas DataFrame constructor for a better representation.The following fields are in the results:
- test_score / test_{scorer-name}
The aggregated value of a score over all data-points. If a single score is used for scoring, then the generic name “score” is used. Otherwise, multiple columns with the name of the respective scorer exist.
- test_single_score / test_single_{scorer-name}
The individual scores per datapoint per fold. This is a list of values with the
len(train_set)
.- test_data_labels
A list of data labels of the train set in the order the single score values are provided. These can be used to associate the
single_score
values with a certain data-point.- train_score / train_{scorer-name}
Results for train set of each fold.
- train_single_score / train_single_{scorer-name}
Results for individual data points in the train set of each fold
- train_data_labels
The data labels for the train set.
- optimize_time
Time required to optimize the pipeline in each fold.
- score_time
Cumulative score time to score all data points in the test set.
- optimizer
The optimized instances per fold. One instance per fold is returned. The optimized version of the pipeline can be obtained via the
optimized_pipeline_
attribute on the instance.