tpcp.optimize
.GridSearch#
- class tpcp.optimize.GridSearch(pipeline: PipelineT, parameter_grid: ParameterGrid, *, scoring: Optional[Union[Callable[[PipelineT, DatasetT], Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]], Dict[str, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]], Scorer[PipelineT, DatasetT, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]] = None, n_jobs: Optional[int] = None, return_optimized: Union[bool, str] = True, pre_dispatch: Union[int, str] = 'n_jobs', progress_bar: bool = True)[source]#
Perform a grid search over various parameters.
This scores the pipeline for every combination of data points in the provided dataset and parameter combinations in the
parameter_grid
. The scores over the entire dataset are then aggregated for each parameter combination. By default, this aggregation is a simple average.Note
This is different to how grid search works in many other cases: Usually, the performance parameter would be calculated on all data points at once. Here, each data point represents an entire participant or recording (depending on the dataset). Therefore, the pipeline and the scoring method are expected to provide a result/score per data point in the dataset. Note that it is still open to your interpretation what you consider a “data point” in the context of your analysis. The
run
method of the pipeline can still process multiple data points, e.g., gait tests, in a loop and generate a single output if you consider a single participant one data point.- Parameters:
- pipeline
The pipeline object to optimize
- parameter_grid
A sklearn parameter grid to define the search space.
- scoring
A callable that can score a single data point given a pipeline. This function should return either a single score or a dictionary of scores. If scoring is
None
the defaultscore
method of the pipeline is used instead.Note that if scoring returns a dictionary,
return_optimized
must be set to the name of the score that should be used for ranking.- n_jobs
The number of processes that should be used to parallelize the search.
None
means 1 while -1 means as many as logical processing cores.- pre_dispatch
The number of jobs that should be pre dispatched. For an explanation see the documentation of
GridSearchCV
- return_optimized
If True, a pipeline object with the overall best params is created and stored as
optimized_pipeline_
. Ifscoring
returns a dictionary of score values, this must be astr
corresponding to the name of the score that should be used to rank the results. If False, the respective result attributes will not be populated. If multiple parameter combinations have the same score, the one tested first will be used. Otherwise, higher values are always considered better.- progress_bar
True/False to enable/disable a tqdm progress bar.
- Other Parameters:
- dataset
The dataset instance passed to the optimize method
- Attributes:
- gs_results_
A dictionary summarizing all results of the gridsearch. The format of this dictionary is designed to be directly passed into the
DataFrame
constructor. Each column then represents the result for one set of parametersThe dictionary contains the following entries:
- param_*
The value of a respective parameter
- params
A dictionary representing all parameters
- score / {scorer-name}
The aggregated value of a score over all data-points. If a single score is used for scoring, then the generic name “score” is used. Otherwise, multiple columns with the name of the respective scorer exist
- rank_score / rank_{scorer-name}
A sorting for each score from the highest to the lowest value
- single_score / single_{scorer-name}
The individual scores per data point for each parameter combination. This is a list of values with the
len(dataset)
.- data_labels
A list of data labels in the order the single score values are provided. These can be used to associate the
single_score
values with a certain data point.
- optimized_pipeline_
An instance of the input pipeline with the best parameter set. This is only available if
return_optimized
is not False.- best_params_
The parameter dict that resulted in the best result. This is only available if
return_optimized
is not False.- best_index_
The index of the result row in the output. This is only available if
return_optimized
is not False.- best_score_
The score of the best result. In a multimetric case, only the value of the scorer specified by
return_optimized
is provided. This is only available ifreturn_optimized
is not False.- multimetric_
If the scorer returned multiple scores
Methods
clone
()Create a new instance of the class with all parameters copied over.
get_params
([deep])Get parameters for this algorithm.
optimize
(dataset, **_)Run the grid search over the dataset and find the best parameter combination.
run
(datapoint)Run the optimized pipeline.
safe_run
(datapoint)Run the optimized pipeline.
score
(datapoint)Run score of the optimized pipeline.
set_params
(**params)Set the parameters of this Algorithm.
- __init__(pipeline: PipelineT, parameter_grid: ParameterGrid, *, scoring: Optional[Union[Callable[[PipelineT, DatasetT], Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]], Dict[str, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]], Scorer[PipelineT, DatasetT, Union[T, Aggregator[Any], Dict[str, Union[T, Aggregator[Any]]]]]]] = None, n_jobs: Optional[int] = None, return_optimized: Union[bool, str] = True, pre_dispatch: Union[int, str] = 'n_jobs', progress_bar: bool = True) None [source]#
- _format_results(candidate_params, out)[source]#
Format the final result dict.
This function is adapted based on sklearn’s
BaseSearchCV
- clone() Self [source]#
Create a new instance of the class with all parameters copied over.
This will create a new instance of the class itself and all nested objects
- get_params(deep: bool = True) Dict[str, Any] [source]#
Get parameters for this algorithm.
- Parameters:
- deep
Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like
nested_object_name__
(Note the two “_” at the end)
- Returns:
- params
Parameter names mapped to their values.
- optimize(dataset: DatasetT, **_: Any) Self [source]#
Run the grid search over the dataset and find the best parameter combination.
- Parameters:
- dataset
The dataset used for optimization.
- run(datapoint: DatasetT) PipelineT [source]#
Run the optimized pipeline.
This is a wrapper to contain API compatibility with
Pipeline
.
- safe_run(datapoint: DatasetT) PipelineT [source]#
Run the optimized pipeline.
This is a wrapper to contain API compatibility with
Pipeline
.
Examples using tpcp.optimize.GridSearch
#
Grid Search optimal Algorithm Parameter