GridSearch#

class tpcp.optimize.GridSearch(pipeline: PipelineT, parameter_grid: ParameterGrid, *, scoring: Callable[[PipelineT, DatasetT], T | Aggregator[Any] | dict[str, T | Aggregator[Any]] | dict[str, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]]] | Scorer[PipelineT, DatasetT, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]] | None = None, n_jobs: int | None = None, return_optimized: bool | str = True, pre_dispatch: int | str = 'n_jobs', progress_bar: bool = True)[source]#

Perform a grid search over various parameters.

This scores the pipeline for every combination of data points in the provided dataset and parameter combinations in the parameter_grid. The scores over the entire dataset are then aggregated for each parameter combination. By default, this aggregation is a simple average.

Note

This is different to how grid search works in many other cases: Usually, the performance parameter would be calculated on all data points at once. Here, each data point represents an entire participant or recording (depending on the dataset). Therefore, the pipeline and the scoring method are expected to provide a result/score per data point in the dataset. Note that it is still open to your interpretation what you consider a “data point” in the context of your analysis. The run method of the pipeline can still process multiple data points, e.g., gait tests, in a loop and generate a single output if you consider a single participant one data point.

Parameters:
pipeline

The pipeline object to optimize

parameter_grid

A sklearn parameter grid to define the search space.

scoring

A callable that can score a single data point given a pipeline. This function should return either a single score or a dictionary of scores. If scoring is None the default score method of the pipeline is used instead.

Note that if scoring returns a dictionary, return_optimized must be set to the name of the score that should be used for ranking.

n_jobs

The number of processes that should be used to parallelize the search. None means 1 while -1 means as many as logical processing cores.

pre_dispatch

The number of jobs that should be pre dispatched. For an explanation see the documentation of GridSearchCV

return_optimized

If True, a pipeline object with the overall best params is created and stored as optimized_pipeline_. If scoring returns a dictionary of score values, this must be a str corresponding to the name of the score that should be used to rank the results. If False, the respective result attributes will not be populated. If multiple parameter combinations have the same score, the one tested first will be used. By default, the value with the best rank (i.e. higher score) is used. If you want to select the value with the lowest score, set return_optimized to the name of the score prefixed with a minus sign, e.g. -rmse. In case of a single score, use -score to select the value with the lowest score.

progress_bar

True/False to enable/disable a tqdm progress bar.

Other Parameters:
dataset

The dataset instance passed to the optimize method

Attributes:
gs_results_

A dictionary summarizing all results of the gridsearch. The format of this dictionary is designed to be directly passed into the DataFrame constructor. Each column then represents the result for one set of parameters

The dictionary contains the following entries:

param_*

The value of a respective parameter

params

A dictionary representing all parameters

score / {scorer-name}

The aggregated value of a score over all data-points. If a single score is used for scoring, then the generic name “score” is used. Otherwise, multiple columns with the name of the respective scorer exist

rank_score / rank_{scorer-name}

A sorting for each score from the highest to the lowest value. If lower or higher values are better, depends on the scoring function and needs to be interpreted accordingly.

single_score / single_{scorer-name}

The individual scores per data point for each parameter combination. This is a list of values with the len(dataset).

data_labels

A list of data labels in the order the single score values are provided. These can be used to associate the single_score values with a certain data point.

optimized_pipeline_

An instance of the input pipeline with the best parameter set. This is only available if return_optimized is not False.

best_params_

The parameter dict that resulted in the best result. This is only available if return_optimized is not False.

best_index_

The index of the result row in the output. This is only available if return_optimized is not False.

best_score_

The score of the best result. In a multimetric case, only the value of the scorer specified by return_optimized is provided. This is only available if return_optimized is not False.

multimetric_

If the scorer returned multiple scores

Methods

clone()

Create a new instance of the class with all parameters copied over.

get_params([deep])

Get parameters for this algorithm.

optimize(dataset, **_)

Run the grid search over the dataset and find the best parameter combination.

run(datapoint)

Run the optimized pipeline.

safe_run(datapoint)

Run the optimized pipeline.

score(datapoint)

Run score of the optimized pipeline.

set_params(**params)

Set the parameters of this Algorithm.

__init__(pipeline: PipelineT, parameter_grid: ParameterGrid, *, scoring: Callable[[PipelineT, DatasetT], T | Aggregator[Any] | dict[str, T | Aggregator[Any]] | dict[str, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]]] | Scorer[PipelineT, DatasetT, T | Aggregator[Any] | dict[str, T | Aggregator[Any]]] | None = None, n_jobs: int | None = None, return_optimized: bool | str = True, pre_dispatch: int | str = 'n_jobs', progress_bar: bool = True) None[source]#
_format_results(candidate_params, out)[source]#

Format the final result dict.

This function is adapted based on sklearn’s BaseSearchCV

clone() Self[source]#

Create a new instance of the class with all parameters copied over.

This will create a new instance of the class itself and all nested objects

get_params(deep: bool = True) dict[str, Any][source]#

Get parameters for this algorithm.

Parameters:
deep

Only relevant if object contains nested algorithm objects. If this is the case and deep is True, the params of these nested objects are included in the output using a prefix like nested_object_name__ (Note the two “_” at the end)

Returns:
params

Parameter names mapped to their values.

optimize(dataset: DatasetT, **_: Any) Self[source]#

Run the grid search over the dataset and find the best parameter combination.

Parameters:
dataset

The dataset used for optimization.

run(datapoint: DatasetT) PipelineT[source]#

Run the optimized pipeline.

This is a wrapper to contain API compatibility with Pipeline.

safe_run(datapoint: DatasetT) PipelineT[source]#

Run the optimized pipeline.

This is a wrapper to contain API compatibility with Pipeline.

score(datapoint: DatasetT) float | dict[str, float][source]#

Run score of the optimized pipeline.

This is a wrapper to contain API compatibility with Pipeline.

set_params(**params: Any) Self[source]#

Set the parameters of this Algorithm.

To set parameters of nested objects use nested_object_name__para_name=.

Examples using tpcp.optimize.GridSearch#

Grid Search optimal Algorithm Parameter

Grid Search optimal Algorithm Parameter

Optimizable Pipelines

Optimizable Pipelines

GridSearchCV

GridSearchCV

Custom Optuna Optimizer

Custom Optuna Optimizer

Build-in Optuna Optimizers

Build-in Optuna Optimizers

Dataclass and Attrs support

Dataclass and Attrs support