tpcp.optimize
.GridSearchCV#
- class tpcp.optimize.GridSearchCV(pipeline: OptimizablePipelineT, parameter_grid: ParameterGrid, *, scoring: Optional[Union[Callable[[OptimizablePipelineT, DatasetT], Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]], Dict[str, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]], Scorer[OptimizablePipelineT, DatasetT, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]] = None, return_optimized: Union[bool, str] = True, cv: Optional[Union[int, BaseCrossValidator, Iterator]] = None, pure_parameters: Union[bool, List[str]] = False, return_train_score: bool = False, verbose: int = 0, n_jobs: Optional[int] = None, pre_dispatch: Union[int, str] = 'n_jobs', progress_bar: bool = True, safe_optimize: bool = True)[source]#
Exhaustive (hyper)parameter search using a cross validation based score to optimize pipeline parameters.
This class follows as much as possible the interface of
GridSearchCV
. If thetpcp
documentation is missing some information, the respective documentation ofsklearn
might be helpful.Compared to the
sklearn
implementation this method uses a couple oftpcp
-specific optimizations and quality-of-life improvements.- Parameters:
- pipeline
A tpcp pipeline implementing
self_optimize
.- parameter_grid
A sklearn parameter grid to define the search space for the grid search.
- scoring
A callable that can score a single data point given a pipeline. This function should return either a single score or a dictionary of scores. If scoring is
None
the defaultscore
method of the pipeline is used instead.Note
If scoring returns a dictionary,
return_optimized
must be set to the name of the score that should be used for ranking.- return_optimized
If True, a pipeline object with the overall best parameters is created and re-optimized using all provided data as input. The optimized pipeline object is stored as
optimized_pipeline_
. Ifscoring
returns a dictionary of score values, this must be a str corresponding to the name of the score that should be used to rank the results. If False, the respective result attributes will not be populated. If multiple parameter combinations have the same mean score over all CV folds, the one tested first will be used. Otherwise, higher mean values are always considered better.- cv
An integer specifying the number of folds in a K-Fold cross validation or a valid cross validation helper. The default (
None
) will result in a 5-fold cross validation. For further inputs check thesklearn
documentation.- pure_parameters
Warning
Do not use this option unless you fully understand it!
A list of parameter names (named in the
parameter_grid
) that do not affect training aka are not hyperparameters. This information can be used for massive performance improvements, as the training does not need to be repeated if one of these parameters changes. However, setting it incorrectly can lead detect errors that are very hard to detect in the final results.Instead of passing a list of names, you can also just set the value to
True
. In this case all parameters of the provided pipeline that are marked aspure_parameter
are used. Note that pure parameters of nested objects are not considered, but only top-level attributes. If you need to mark nested parameters as pure, use the first method and pass the names (with__
) as part of the list of names.For more information on this approach see the
evaluation guide
.- return_train_score
If True the performance on the train score is returned in addition to the test score performance. Note, that this increases the runtime. If
True
, the fieldstrain_score
, andtrain_score_single
are available in the results.- verbose
Control the verbosity of information printed during the optimization (larger number -> higher verbosity). At the moment this will only affect the caching done, when
pure_parameter_names
are provided.- n_jobs
The number of parallel jobs. The default (
None
) means 1 job at the time, hence, no parallel computing. -1 means as many as logical processing cores. One job is created per cv + para combi combination.- pre_dispatch
The number of jobs that should be pre dispatched. For an explanation see the documentation of
GridSearchCV
- progress_bar
True/False to enable/disable a tqdm progressbar.
- safe_optimize
If True, we add additional checks to make sure the
self_optimize
method of the pipeline is correctly implemented. Seemake_optimize_safe
for more info.
- Other Parameters:
- dataset
The dataset instance passed to the optimize method
- groups
The groups passed to the optimize method
- mock_labels
The mock labels passed to the optimize method
- Attributes:
- cv_results_
A dictionary summarizing all results of the gridsearch. The format of this dictionary is designed to be directly passed into the
pd.DataFrame
constructor. Each column then represents the result for one set of parameters.The dictionary contains the following entries:
- param_{parameter_name}
The value of a respective parameter.
- params
A dictionary representing all parameters.
- mean_test_score / mean_test_{scorer_name}
The average test score over all folds. If a single score is used for scoring, then the generic name “score” is used. Otherwise, multiple columns with the name of the respective scorer exist.
- std_test_score / std_test_{scorer_name}
The std of the test scores over all folds.
- rank_test_score / rank_{scorer_name}
The rank of the mean test score assuming higher values are better.
- split{n}_test_score / split{n}_test_{scorer_name}
The performance on the test set in fold n.
- split{n}_test_single_score / split{n}_test_single_{scorer_name}
The performance in fold n on every single data point in the test set.
- split{n}_test_data_labels
The ids of the data points used in the test set of fold n.
- mean_train_score / mean_train_{scorer_name}
The average train score over all folds.
- std_train_score / std_train_{scorer_name}
The std of the train scores over all folds.
- split{n}_train_score / split{n}_train_{scorer_name}
The performance on the train set in fold n.
- rank_train_score / rank_{scorer_name}
The rank of the mean train score assuming higher values are better.
- split{n}_train_single_score / split{n}_train_single_{scorer_name}
The performance in fold n on every single datapoint in the train set.
- split{n}_train_data_labels
The ids of the data points used in the train set of fold n.
- mean_{optimize/score}_time
Average time over all folds spent for optimization and scoring, respectively.
- std_{optimize/score}_time
Standard deviation of the optimize/score times over all folds.
- optimized_pipeline_
An instance of the input pipeline with the best parameter set. This is only available if
return_optimized
is not False.- best_params_
The parameter dict that resulted in the best result. This is only available if
return_optimized
is not False.- best_index_
The index of the result row in the output. This is only available if
return_optimized
is not False.- best_score_
The score of the best result. In a multimetric case, only the value of the scorer specified by
return_optimized
is provided. This is only available ifreturn_optimized
is not False.- multimetric_
If the scorer returned multiple scores
- final_optimize_time_
Time spent to perform the final optimization on all data. This is only available if
return_optimized
is not False.
Methods
clone
()Create a new instance of the class with all parameters copied over.
get_params
([deep])Get parameters for this algorithm.
optimize
(dataset, *[, groups, mock_labels])Run the GridSearchCV on the given dataset.
run
(datapoint)Run the optimized pipeline.
safe_run
(datapoint)Run the optimized pipeline.
score
(datapoint)Run score of the optimized pipeline.
set_params
(**params)Set the parameters of this Algorithm.
- __init__(pipeline: OptimizablePipelineT, parameter_grid: ParameterGrid, *, scoring: Optional[Union[Callable[[OptimizablePipelineT, DatasetT], Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]], Dict[str, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]], Scorer[OptimizablePipelineT, DatasetT, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]] = None, return_optimized: Union[bool, str] = True, cv: Optional[Union[int, BaseCrossValidator, Iterator]] = None, pure_parameters: Union[bool, List[str]] = False, return_train_score: bool = False, verbose: int = 0, n_jobs: Optional[int] = None, pre_dispatch: Union[int, str] = 'n_jobs', progress_bar: bool = True, safe_optimize: bool = True) None [source]#
- _format_results(candidate_params, n_splits, out, more_results=None)[source]#
Format the final result dict.
This function is adapted based on sklearn’s
BaseSearchCV
.
- clone() Self [source]#
Create a new instance of the class with all parameters copied over.
This will create a new instance of the class itself and all nested objects
- get_params(deep: bool = True) Dict[str, Any] [source]#
Get parameters for this algorithm.
- Parameters:
- deep
Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like
nested_object_name__
(Note the two “_” at the end)
- Returns:
- params
Parameter names mapped to their values.
- optimize(dataset: DatasetT, *, groups=None, mock_labels=None, **optimize_params) Self [source]#
Run the GridSearchCV on the given dataset.
- Parameters:
- dataset
The dataset to optimize on.
- groups
An optional set of group labels that are passed to the cross-validation helper.
- mock_labels
An optional set of mocked labels that are passed to the cross-validation helper as the
y
parameter. This can be helpful in combination with theStratified*Fold
cross-validation helpers, that use they
parameter to stratify the folds.
- run(datapoint: DatasetT) PipelineT [source]#
Run the optimized pipeline.
This is a wrapper to contain API compatibility with
Pipeline
.
- safe_run(datapoint: DatasetT) PipelineT [source]#
Run the optimized pipeline.
This is a wrapper to contain API compatibility with
Pipeline
.