.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/parameter_optimization/_03_gridsearch_cv.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_parameter_optimization__03_gridsearch_cv.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_parameter_optimization__03_gridsearch_cv.py:


.. _gridsearch_cv:

GridSearchCV
============

When trying to optimize parameters for algorithms that have trainable components, it is required to perform
the parameter search on a validation set (that is separate from the test set used for the final validation).
Even better, is to use a cross validation for this step.
In tpcp this can be done by using :class:`~tpcp.optimize.GridSearchCV`.

This example explains how to use this method.
To learn more about the concept, review the :ref:`evaluation guide <algorithm_evaluation>` and the `sklearn guide on
tuning hyperparameters <https://scikit-learn.org/stable/modules/grid_search.html#grid-search>`_.

.. GENERATED FROM PYTHON SOURCE LINES 17-24

.. code-block:: default

    import random

    import pandas as pd
    from typing_extensions import Self

    random.seed(1)  # We set the random seed for repeatable results


.. GENERATED FROM PYTHON SOURCE LINES 25-29

Dataset
-------
As always, we need a dataset, a pipeline, and a scoring method for a parameter search.
Here, we're just going to reuse the ECGExample dataset we created in :ref:`custom_dataset_ecg`.

.. GENERATED FROM PYTHON SOURCE LINES 29-42

.. code-block:: default

    from pathlib import Path

    from examples.datasets.datasets_final_ecg import ECGExampleData

    try:
        HERE = Path(__file__).parent
    except NameError:
        HERE = Path().resolve()
    data_path = HERE.parent.parent / "example_data/ecg_mit_bih_arrhythmia/data"
    example_data = ECGExampleData(data_path)

    from typing import Any


.. GENERATED FROM PYTHON SOURCE LINES 43-52

The Pipeline
------------
When using `GridSearchCV` our pipeline must be "optimizable".
Otherwise, we have no need for the CV part and could just use a simple gridsearch.
Here we are going to create an optimizable pipeline that wraps the optimizable version of the QRS detector we
developed in :ref:`custom_algorithms_qrs_detection`.

For more information about the pipeline below check our examples on :ref:`optimize_pipelines`.
Todo: Full dedicated example for `PureParameter`

.. GENERATED FROM PYTHON SOURCE LINES 52-84

.. code-block:: default

    from examples.algorithms.algorithms_qrs_detection_final import OptimizableQrsDetector
    from tpcp import OptimizableParameter, OptimizablePipeline, Parameter, cf


    class MyPipeline(OptimizablePipeline[ECGExampleData]):
        algorithm: Parameter[OptimizableQrsDetector]
        algorithm__min_r_peak_height_over_baseline: OptimizableParameter[float]

        r_peak_positions_: pd.Series

        def __init__(self, algorithm: OptimizableQrsDetector = cf(OptimizableQrsDetector())):
            self.algorithm = algorithm

        def self_optimize(self, dataset: ECGExampleData, **kwargs: Any):
            ecg_data = [d.data["ecg"] for d in dataset]
            r_peaks = [d.r_peak_positions_["r_peak_position"] for d in dataset]
            # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.
            algo = self.algorithm.clone()
            self.algorithm = algo.self_optimize(ecg_data, r_peaks, dataset.sampling_rate_hz)
            return self

        def run(self, datapoint: ECGExampleData) -> Self:
            # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.
            algo = self.algorithm.clone()
            algo.detect(datapoint.data["ecg"], datapoint.sampling_rate_hz)

            self.r_peak_positions_ = algo.r_peak_positions_
            return self


    pipe = MyPipeline()


.. GENERATED FROM PYTHON SOURCE LINES 85-89

The Scorer
----------
The scorer is identical to the scoring function used in the other examples.
The F1-score is still the most important parameter for our comparison.

.. GENERATED FROM PYTHON SOURCE LINES 89-108

.. code-block:: default


    from examples.algorithms.algorithms_qrs_detection_final import match_events_with_reference, precision_recall_f1_score


    def score(pipeline: MyPipeline, datapoint: ECGExampleData) -> dict[str, float]:
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {"precision": precision, "recall": recall, "f1_score": f1_score}


.. GENERATED FROM PYTHON SOURCE LINES 109-116

Data Splitting
--------------
Like with a normal cross validation, we need to decide on the number of folds and type of splits.
In `tpcp` we support all cross validation iterators provided in
`sklearn <https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators>`__.

To keep the runtime low for this example, we are going to use a 2-fold CV.

.. GENERATED FROM PYTHON SOURCE LINES 116-120

.. code-block:: default

    from sklearn.model_selection import KFold

    cv = KFold(n_splits=2)


.. GENERATED FROM PYTHON SOURCE LINES 121-132

The Parameters
--------------
The pipeline above exposes a couple of (nested) parameters.
`min_r_peak_height_over_baseline` is the parameter we want to optimize.
All other parameters are effectively hyper-parameters as they change the outcome of the optimization.
We could differentiate further and say that only `r_peak_match_tolerance_s` is a true hyper parameter, as it only
effects the outcome of the optimization, but the `run` method is independent from it.
`max_heart_rate_bpm` and `high_pass_filter_cutoff_hz` effect both the optimization and `run`.

We could run the gridsearch over any combination of parameters.
However, to keep things simple, we will only test a couple of values for `high_pass_filter_cutoff_hz`.

.. GENERATED FROM PYTHON SOURCE LINES 132-136

.. code-block:: default

    from sklearn.model_selection import ParameterGrid

    parameters = ParameterGrid({"algorithm__high_pass_filter_cutoff_hz": [0.25, 0.5, 1]})


.. GENERATED FROM PYTHON SOURCE LINES 137-142

GridSearchCV
------------
Setting up the GridSearchCV object is similar to the normal GridSearch, we just need to add the additional `cv`
parameter.
Then we can simply run the search using the `optimize` method.

.. GENERATED FROM PYTHON SOURCE LINES 142-147

.. code-block:: default

    from tpcp.optimize import GridSearchCV

    gs = GridSearchCV(pipeline=MyPipeline(), parameter_grid=parameters, scoring=score, cv=cv, return_optimized="f1_score")
    gs = gs.optimize(example_data)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Split-Para Combos:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:  50%|█████     | 3/6 [00:00<00:00, 17.86it/s]
    Datapoints:  83%|████████▎ | 5/6 [00:00<00:00, 15.92it/s]    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.92it/s]
    Split-Para Combos:  17%|█▋        | 1/6 [00:00<00:03,  1.27it/s]
    Datapoints:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:  33%|███▎      | 2/6 [00:00<00:00, 14.45it/s]
    Datapoints:  67%|██████▋   | 4/6 [00:00<00:00, 14.91it/s]
    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 14.64it/s]    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 14.65it/s]
    Split-Para Combos:  33%|███▎      | 2/6 [00:01<00:03,  1.29it/s]
    Datapoints:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:  33%|███▎      | 2/6 [00:00<00:00, 15.48it/s]
    Datapoints:  67%|██████▋   | 4/6 [00:00<00:00, 15.38it/s]
    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.24it/s]    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.27it/s]
    Split-Para Combos:  50%|█████     | 3/6 [00:02<00:02,  1.32it/s]
    Datapoints:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:  33%|███▎      | 2/6 [00:00<00:00, 15.43it/s]
    Datapoints:  67%|██████▋   | 4/6 [00:00<00:00, 15.07it/s]
    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 14.82it/s]    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 14.90it/s]
    Split-Para Combos:  67%|██████▋   | 4/6 [00:03<00:01,  1.32it/s]
    Datapoints:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:  33%|███▎      | 2/6 [00:00<00:00, 15.34it/s]
    Datapoints:  67%|██████▋   | 4/6 [00:00<00:00, 15.25it/s]
    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.34it/s]    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.31it/s]
    Split-Para Combos:  83%|████████▎ | 5/6 [00:03<00:00,  1.33it/s]
    Datapoints:   0%|          | 0/6 [00:00<?, ?it/s]
    Datapoints:  33%|███▎      | 2/6 [00:00<00:00, 15.40it/s]
    Datapoints:  67%|██████▋   | 4/6 [00:00<00:00, 14.98it/s]
    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.17it/s]    Datapoints: 100%|██████████| 6/6 [00:00<00:00, 15.15it/s]
    Split-Para Combos: 100%|██████████| 6/6 [00:04<00:00,  1.33it/s]    Split-Para Combos: 100%|██████████| 6/6 [00:04<00:00,  1.32it/s]


.. GENERATED FROM PYTHON SOURCE LINES 148-154

Results
-------
The output is also comparable to the output of the :class:`~tpcp.optimize.GridSearch`.
The main results are stored in the `cv_results_` parameter.
But instead of just a single performance value per parameter, we get one value per fold and the mean and std over
all folds.

.. GENERATED FROM PYTHON SOURCE LINES 154-159

.. code-block:: default

    results = gs.cv_results_
    results_df = pd.DataFrame(results)

    results_df


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>mean_optimize_time</th>
          <th>std_optimize_time</th>
          <th>mean_score_time</th>
          <th>std_score_time</th>
          <th>split0_test_data_labels</th>
          <th>split1_test_data_labels</th>
          <th>split0_train_data_labels</th>
          <th>split1_train_data_labels</th>
          <th>param_algorithm__high_pass_filter_cutoff_hz</th>
          <th>params</th>
          <th>split0_test_precision</th>
          <th>split1_test_precision</th>
          <th>mean_test_precision</th>
          <th>std_test_precision</th>
          <th>rank_test_precision</th>
          <th>split0_test_recall</th>
          <th>split1_test_recall</th>
          <th>mean_test_recall</th>
          <th>std_test_recall</th>
          <th>rank_test_recall</th>
          <th>split0_test_f1_score</th>
          <th>split1_test_f1_score</th>
          <th>mean_test_f1_score</th>
          <th>std_test_f1_score</th>
          <th>rank_test_f1_score</th>
          <th>split0_test_single_precision</th>
          <th>split1_test_single_precision</th>
          <th>split0_test_single_recall</th>
          <th>split1_test_single_recall</th>
          <th>split0_test_single_f1_score</th>
          <th>split1_test_single_f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0.335774</td>
          <td>0.024256</td>
          <td>0.427677</td>
          <td>0.015528</td>
          <td>[(group_1, 100), (group_2, 102), (group_3, 104...</td>
          <td>[(group_1, 114), (group_2, 116), (group_3, 119...</td>
          <td>[(group_1, 114), (group_2, 116), (group_3, 119...</td>
          <td>[(group_1, 100), (group_2, 102), (group_3, 104...</td>
          <td>0.25</td>
          <td>{'algorithm__high_pass_filter_cutoff_hz': 0.25}</td>
          <td>0.939253</td>
          <td>0.936646</td>
          <td>0.937949</td>
          <td>0.001304</td>
          <td>3</td>
          <td>0.886065</td>
          <td>0.800842</td>
          <td>0.843453</td>
          <td>0.042612</td>
          <td>1</td>
          <td>0.903974</td>
          <td>0.824031</td>
          <td>0.864003</td>
          <td>0.039972</td>
          <td>2</td>
          <td>[0.9995600527936648, 0.9724391364262747, 0.961...</td>
          <td>[0.8974358974358975, 0.9954147561483951, 0.998...</td>
          <td>[0.9995600527936648, 0.9679926840420667, 0.967...</td>
          <td>[0.1490154337413518, 0.9900497512437811, 0.999...</td>
          <td>[0.9995600527936648, 0.9702108157653528, 0.964...</td>
          <td>[0.25559105431309903, 0.9927250051964249, 0.99...</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.308887</td>
          <td>0.000973</td>
          <td>0.431293</td>
          <td>0.004875</td>
          <td>[(group_1, 100), (group_2, 102), (group_3, 104...</td>
          <td>[(group_1, 114), (group_2, 116), (group_3, 119...</td>
          <td>[(group_1, 114), (group_2, 116), (group_3, 119...</td>
          <td>[(group_1, 100), (group_2, 102), (group_3, 104...</td>
          <td>0.5</td>
          <td>{'algorithm__high_pass_filter_cutoff_hz': 0.5}</td>
          <td>0.951106</td>
          <td>0.946568</td>
          <td>0.948837</td>
          <td>0.002269</td>
          <td>2</td>
          <td>0.881955</td>
          <td>0.795663</td>
          <td>0.838809</td>
          <td>0.043146</td>
          <td>3</td>
          <td>0.904481</td>
          <td>0.818777</td>
          <td>0.861629</td>
          <td>0.042852</td>
          <td>3</td>
          <td>[0.9995600527936648, 0.9722735674676525, 0.962...</td>
          <td>[0.9486166007905138, 0.9974947807933194, 0.998...</td>
          <td>[0.9995600527936648, 0.9620484682213077, 0.967...</td>
          <td>[0.12772751463544438, 0.9904643449419569, 0.99...</td>
          <td>[0.9995600527936648, 0.9671339921857045, 0.964...</td>
          <td>[0.225140712945591, 0.9939671312669024, 0.9992...</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0.309644</td>
          <td>0.001590</td>
          <td>0.427927</td>
          <td>0.002174</td>
          <td>[(group_1, 100), (group_2, 102), (group_3, 104...</td>
          <td>[(group_1, 114), (group_2, 116), (group_3, 119...</td>
          <td>[(group_1, 114), (group_2, 116), (group_3, 119...</td>
          <td>[(group_1, 100), (group_2, 102), (group_3, 104...</td>
          <td>1</td>
          <td>{'algorithm__high_pass_filter_cutoff_hz': 1}</td>
          <td>0.959503</td>
          <td>0.946947</td>
          <td>0.953225</td>
          <td>0.006278</td>
          <td>1</td>
          <td>0.882437</td>
          <td>0.796150</td>
          <td>0.839294</td>
          <td>0.043143</td>
          <td>2</td>
          <td>0.907905</td>
          <td>0.823164</td>
          <td>0.865534</td>
          <td>0.042370</td>
          <td>1</td>
          <td>[0.9995600527936648, 0.9723119520073835, 0.962...</td>
          <td>[0.9228070175438596, 0.9974947807933194, 0.998...</td>
          <td>[0.9995600527936648, 0.9634202103337905, 0.968...</td>
          <td>[0.13996806812134113, 0.9904643449419569, 0.99...</td>
          <td>[0.9995600527936648, 0.9678456591639871, 0.965...</td>
          <td>[0.24306839186691312, 0.9939671312669024, 0.99...</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 160-162

The mean score is the primary parameter used to select the best parameter combi (if `return_optimized` is True).
All other values performance values are just there to provide further insight.

.. GENERATED FROM PYTHON SOURCE LINES 162-165

.. code-block:: default


    results_df[["mean_test_precision", "mean_test_recall", "mean_test_f1_score"]]


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>mean_test_precision</th>
          <th>mean_test_recall</th>
          <th>mean_test_f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0.937949</td>
          <td>0.843453</td>
          <td>0.864003</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.948837</td>
          <td>0.838809</td>
          <td>0.861629</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0.953225</td>
          <td>0.839294</td>
          <td>0.865534</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 166-167

For even more insight, you can inspect the scores per datapoint:

.. GENERATED FROM PYTHON SOURCE LINES 167-170

.. code-block:: default


    results_df.filter(like="test_single")


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>split0_test_single_precision</th>
          <th>split1_test_single_precision</th>
          <th>split0_test_single_recall</th>
          <th>split1_test_single_recall</th>
          <th>split0_test_single_f1_score</th>
          <th>split1_test_single_f1_score</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>[0.9995600527936648, 0.9724391364262747, 0.961...</td>
          <td>[0.8974358974358975, 0.9954147561483951, 0.998...</td>
          <td>[0.9995600527936648, 0.9679926840420667, 0.967...</td>
          <td>[0.1490154337413518, 0.9900497512437811, 0.999...</td>
          <td>[0.9995600527936648, 0.9702108157653528, 0.964...</td>
          <td>[0.25559105431309903, 0.9927250051964249, 0.99...</td>
        </tr>
        <tr>
          <th>1</th>
          <td>[0.9995600527936648, 0.9722735674676525, 0.962...</td>
          <td>[0.9486166007905138, 0.9974947807933194, 0.998...</td>
          <td>[0.9995600527936648, 0.9620484682213077, 0.967...</td>
          <td>[0.12772751463544438, 0.9904643449419569, 0.99...</td>
          <td>[0.9995600527936648, 0.9671339921857045, 0.964...</td>
          <td>[0.225140712945591, 0.9939671312669024, 0.9992...</td>
        </tr>
        <tr>
          <th>2</th>
          <td>[0.9995600527936648, 0.9723119520073835, 0.962...</td>
          <td>[0.9228070175438596, 0.9974947807933194, 0.998...</td>
          <td>[0.9995600527936648, 0.9634202103337905, 0.968...</td>
          <td>[0.13996806812134113, 0.9904643449419569, 0.99...</td>
          <td>[0.9995600527936648, 0.9678456591639871, 0.965...</td>
          <td>[0.24306839186691312, 0.9939671312669024, 0.99...</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 171-174

If `return_optimized` was set to True (or the name of a score), a final optimization is performed using the best
set of parameters and **all** the available data.
The resulting pipeline will be stored in `optimizable_pipeline_`.

.. GENERATED FROM PYTHON SOURCE LINES 174-177

.. code-block:: default

    print("Best Para Combi:", gs.best_params_)
    print("Paras of optimized Pipeline:", gs.optimized_pipeline_.get_params())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Best Para Combi: {'algorithm__high_pass_filter_cutoff_hz': 1}
    Paras of optimized Pipeline: {'algorithm__high_pass_filter_cutoff_hz': 1, 'algorithm__max_heart_rate_bpm': 200.0, 'algorithm__min_r_peak_height_over_baseline': 0.6322168257130579, 'algorithm__r_peak_match_tolerance_s': 0.01, 'algorithm': OptimizableQrsDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=0.6322168257130579, r_peak_match_tolerance_s=0.01)}


.. GENERATED FROM PYTHON SOURCE LINES 178-183

To run the optimized pipeline, we can directly use the `run`/`safe_run` method on the `GridSearchCV` object.
This makes it possible to use the `GridSearchCV` as a replacement for your pipeline object with minimal code changes.

If you tried to call `run`/`safe_run` (or `score` for that matter), before the optimization, an error is
raised.

.. GENERATED FROM PYTHON SOURCE LINES 183-185

.. code-block:: default

    r_peaks = gs.safe_run(example_data[0]).r_peak_positions_
    r_peaks


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0           77
    1          370
    2          663
    3          947
    4         1231
             ...  
    2268    648978
    2269    649232
    2270    649485
    2271    649734
    2272    649992
    Length: 2273, dtype: int64


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 7.435 seconds)

**Estimated memory usage:**  38 MB


.. _sphx_glr_download_auto_examples_parameter_optimization__03_gridsearch_cv.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: _03_gridsearch_cv.py <_03_gridsearch_cv.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: _03_gridsearch_cv.ipynb <_03_gridsearch_cv.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_