GridSearchCV#

class tpcp.optimize.GridSearchCV(pipeline: OptimizablePipelineT, parameter_grid: ParameterGrid, *, scoring: Callable[[OptimizablePipelineT, DatasetT], T | Aggregator[Any] | dict[str, T | Aggregator[Any]] | dict[str, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]]] | Scorer[OptimizablePipelineT, DatasetT, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]] | None = None, return_optimized: bool | str = True, cv: int | BaseCrossValidator | Iterator | None = None, pure_parameters: bool | list[str] = False, return_train_score: bool = False, verbose: int = 0, n_jobs: int | None = None, pre_dispatch: int | str = 'n_jobs', progress_bar: bool = True, safe_optimize: bool = True, optimize_with_info: bool = True)[source]#

Exhaustive (hyper)parameter search using a cross validation based score to optimize pipeline parameters.

This class follows as much as possible the interface of GridSearchCV. If the tpcp documentation is missing some information, the respective documentation of sklearn might be helpful.

Compared to the sklearn implementation this method uses a couple of tpcp-specific optimizations and quality-of-life improvements.

Parameters:
pipeline

A tpcp pipeline implementing self_optimize.

parameter_grid

A sklearn parameter grid to define the search space for the grid search.

scoring

A callable that can score a single data point given a pipeline. This function should return either a single score or a dictionary of scores. If scoring is None the default score method of the pipeline is used instead.

Note

If scoring returns a dictionary, return_optimized must be set to the name of the score that should be used for ranking.

return_optimized

If True, a pipeline object with the overall best parameters is created and re-optimized using all provided data as input. The optimized pipeline object is stored as optimized_pipeline_. If scoring returns a dictionary of score values, this must be a str corresponding to the name of the score that should be used to rank the results. If False, the respective result attributes will not be populated. If multiple parameter combinations have the same mean score over all CV folds, the one tested first will be used. By default, the value with the best rank (i.e. higher score) is used. If you want to select the value with the lowest score, set return_optimized to the name of the score prefixed with a minus sign, e.g. -rmse. In case of a single score, use -score to select the value with the lowest score.

cv

An integer specifying the number of folds in a K-Fold cross validation or a valid cross validation helper. The default (None) will result in a 5-fold cross validation. For further inputs check the sklearn documentation.

pure_parameters

Warning

Do not use this option unless you fully understand it!

A list of parameter names (named in the parameter_grid) that do not affect training aka are not hyperparameters. This information can be used for massive performance improvements, as the training does not need to be repeated if one of these parameters changes. However, setting it incorrectly can lead detect errors that are very hard to detect in the final results.

Instead of passing a list of names, you can also just set the value to True. In this case all parameters of the provided pipeline that are marked as PureParameter are used. Note that pure parameters of nested objects are not considered, but only top-level attributes. If you need to mark nested parameters as pure, use the first method and pass the names (with __) as part of the list of names.

For more information on this approach see the evaluation guide.

return_train_score

If True the performance on the train score is returned in addition to the test score performance. Note, that this increases the runtime. If True, the fields train_score, and train_score_single are available in the results.

verbose

Control the verbosity of information printed during the optimization (larger number -> higher verbosity). At the moment this will only affect the caching done, when pure_parameter_names are provided.

n_jobs

The number of parallel jobs. The default (None) means 1 job at the time, hence, no parallel computing. -1 means as many as logical processing cores. One job is created per cv + para combi combination.

pre_dispatch

The number of jobs that should be pre dispatched. For an explanation see the documentation of GridSearchCV

progress_bar

True/False to enable/disable a tqdm progressbar.

safe_optimize

If True, we add additional checks to make sure the self_optimize method of the pipeline is correctly implemented. See make_optimize_safe for more info.

optimize_with_info

If True, Optimize will try to call self_optimize_with_info by default and will fall back to self_optimize. If you want to force the optimization to use self_optimize, even if an implementation of self_optimize_with_info exists, set this parameter to False.

Other Parameters:
dataset

The dataset instance passed to the optimize method

groups

The groups passed to the optimize method

mock_labels

The mock labels passed to the optimize method

Attributes:
cv_results_

A dictionary summarizing all results of the gridsearch. The format of this dictionary is designed to be directly passed into the DataFrame constructor. Each column then represents the result for one set of parameters.

The dictionary contains the following entries:

param_{parameter_name}

The value of a respective parameter.

params

A dictionary representing all parameters.

mean_test_score / mean_test_{scorer_name}

The average test score over all folds. If a single score is used for scoring, then the generic name “score” is used. Otherwise, multiple columns with the name of the respective scorer exist.

std_test_score / std_test_{scorer_name}

The std of the test scores over all folds.

rank_test_score / rank_{scorer_name}

The rank of the mean test score assuming higher values are better.

split{n}_test_score / split{n}_test_{scorer_name}

The performance on the test set in fold n.

split{n}_test_single_score / split{n}_test_single_{scorer_name}

The performance in fold n on every single data point in the test set.

split{n}_test_data_labels

The ids of the data points used in the test set of fold n.

mean_train_score / mean_train_{scorer_name}

The average train score over all folds.

std_train_score / std_train_{scorer_name}

The std of the train scores over all folds.

split{n}_train_score / split{n}_train_{scorer_name}

The performance on the train set in fold n.

rank_train_score / rank_{scorer_name}

The rank of the mean train score assuming higher values are better.

split{n}_train_single_score / split{n}_train_single_{scorer_name}

The performance in fold n on every single datapoint in the train set.

split{n}_train_data_labels

The ids of the data points used in the train set of fold n.

mean_{optimize/score}_time

Average time over all folds spent for optimization and scoring, respectively.

std_{optimize/score}_time

Standard deviation of the optimize/score times over all folds.

optimized_pipeline_

An instance of the input pipeline with the best parameter set. This is only available if return_optimized is not False.

best_params_

The parameter dict that resulted in the best result. This is only available if return_optimized is not False.

best_index_

The index of the result row in the output. This is only available if return_optimized is not False.

best_score_

The score of the best result. In a multimetric case, only the value of the scorer specified by return_optimized is provided. This is only available if return_optimized is not False.

multimetric_

If the scorer returned multiple scores

final_optimize_time_

Time spent to perform the final optimization on all data. This is only available if return_optimized is not False.

Methods

clone()

Create a new instance of the class with all parameters copied over.

get_params([deep])

Get parameters for this algorithm.

optimize(dataset, *[, groups, mock_labels])

Run the GridSearchCV on the given dataset.

run(datapoint)

Run the optimized pipeline.

safe_run(datapoint)

Run the optimized pipeline.

score(datapoint)

Run score of the optimized pipeline.

set_params(**params)

Set the parameters of this Algorithm.

__init__(pipeline: OptimizablePipelineT, parameter_grid: ParameterGrid, *, scoring: Callable[[OptimizablePipelineT, DatasetT], T | Aggregator[Any] | dict[str, T | Aggregator[Any]] | dict[str, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]]] | Scorer[OptimizablePipelineT, DatasetT, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]] | None = None, return_optimized: bool | str = True, cv: int | BaseCrossValidator | Iterator | None = None, pure_parameters: bool | list[str] = False, return_train_score: bool = False, verbose: int = 0, n_jobs: int | None = None, pre_dispatch: int | str = 'n_jobs', progress_bar: bool = True, safe_optimize: bool = True, optimize_with_info: bool = True) None[source]#
_format_results(candidate_params, n_splits, out, more_results=None)[source]#

Format the final result dict.

This function is adapted based on sklearn’s BaseSearchCV.

clone() Self[source]#

Create a new instance of the class with all parameters copied over.

This will create a new instance of the class itself and all nested objects

get_params(deep: bool = True) dict[str, Any][source]#

Get parameters for this algorithm.

Parameters:
deep

Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like nested_object_name__ (Note the two “_” at the end)

Returns:
params

Parameter names mapped to their values.

optimize(dataset: DatasetT, *, groups=None, mock_labels=None, **optimize_params) Self[source]#

Run the GridSearchCV on the given dataset.

Parameters:
dataset

The dataset to optimize on.

groups

An optional set of group labels that are passed to the cross-validation helper.

mock_labels

An optional set of mocked labels that are passed to the cross-validation helper as the y parameter. This can be helpful in combination with the Stratified*Fold cross-validation helpers, that use the y parameter to stratify the folds.

run(datapoint: DatasetT) PipelineT[source]#

Run the optimized pipeline.

This is a wrapper to contain API compatibility with Pipeline.

safe_run(datapoint: DatasetT) PipelineT[source]#

Run the optimized pipeline.

This is a wrapper to contain API compatibility with Pipeline.

score(datapoint: DatasetT) float | dict[str, float][source]#

Run score of the optimized pipeline.

This is a wrapper to contain API compatibility with Pipeline.

set_params(**params: Any) Self[source]#

Set the parameters of this Algorithm.

To set parameters of nested objects use nested_object_name__para_name=.

Examples using tpcp.optimize.GridSearchCV#

Optimizable Pipelines

Optimizable Pipelines

GridSearchCV

GridSearchCV

Custom Optuna Optimizer

Custom Optuna Optimizer