{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# GridSearchCV\n\nWhen trying to optimize parameters for algorithms that have trainable components, it is required to perform\nthe parameter search on a validation set (that is separate from the test set used for the final validation).\nEven better, is to use a cross validation for this step.\nIn tpcp this can be done by using :class:`~tpcp.optimize.GridSearchCV`.\n\nThis example explains how to use this method.\nTo learn more about the concept, review the `evaluation guide <algorithm_evaluation>` and the `sklearn guide on\ntuning hyperparameters <https://scikit-learn.org/stable/modules/grid_search.html#grid-search>`_.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import random\n\nimport pandas as pd\nfrom typing_extensions import Self\n\nrandom.seed(1)  # We set the random seed for repeatable results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Dataset\nAs always, we need a dataset, a pipeline, and a scoring method for a parameter search.\nHere, we're just going to reuse the ECGExample dataset we created in `custom_dataset_ecg`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from pathlib import Path\n\nfrom examples.datasets.datasets_final_ecg import ECGExampleData\n\ntry:\n    HERE = Path(__file__).parent\nexcept NameError:\n    HERE = Path(\".\").resolve()\ndata_path = HERE.parent.parent / \"example_data/ecg_mit_bih_arrhythmia/data\"\nexample_data = ECGExampleData(data_path)\n\nfrom typing import Any"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Pipeline\nWhen using `GridSearchCV` our pipeline must be \"optimizable\".\nOtherwise, we have no need for the CV part and could just use a simple gridsearch.\nHere we are going to create an optimizable pipeline that wraps the optimizable version of the QRS detector we\ndeveloped in `custom_algorithms_qrs_detection`.\n\nFor more information about the pipeline below check `optimize_pipelines`.\n   Todo: Full dedicated example for `PureParameter`\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from examples.algorithms.algorithms_qrs_detection_final import OptimizableQrsDetector\nfrom tpcp import Dataset, OptimizableParameter, OptimizablePipeline, Parameter, cf\n\n\nclass MyPipeline(OptimizablePipeline[ECGExampleData]):\n    algorithm: Parameter[OptimizableQrsDetector]\n    algorithm__min_r_peak_height_over_baseline: OptimizableParameter[float]\n\n    r_peak_positions_: pd.Series\n\n    def __init__(self, algorithm: OptimizableQrsDetector = cf(OptimizableQrsDetector())):\n        self.algorithm = algorithm\n\n    def self_optimize(self, dataset: ECGExampleData, **kwargs: Any):\n        ecg_data = [d.data[\"ecg\"] for d in dataset]\n        r_peaks = [d.r_peak_positions_[\"r_peak_position\"] for d in dataset]\n        # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.\n        algo = self.algorithm.clone()\n        self.algorithm = algo.self_optimize(ecg_data, r_peaks, dataset.sampling_rate_hz)\n        return self\n\n    def run(self, datapoint: ECGExampleData) -> Self:\n        # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.\n        algo = self.algorithm.clone()\n        algo.detect(datapoint.data[\"ecg\"], datapoint.sampling_rate_hz)\n\n        self.r_peak_positions_ = algo.r_peak_positions_\n        return self\n\n\npipe = MyPipeline()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Scorer\nThe scorer is identical to the scoring function used in the other examples.\nThe F1-score is still the most important parameter for our comparison.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from typing import Any, Dict\n\nfrom examples.algorithms.algorithms_qrs_detection_final import match_events_with_reference\n\n\ndef score(pipeline: MyPipeline, datapoint: ECGExampleData) -> Dict[str, float]:\n    # We use the `safe_run` wrapper instead of just run. This is always a good idea.\n    # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`\n    # will clone it again.\n    pipeline = pipeline.safe_run(datapoint)\n    tolerance_s = 0.02  # We just use 20 ms for this example\n    matches_events, _ = match_events_with_reference(\n        pipeline.r_peak_positions_.to_numpy(),\n        datapoint.r_peak_positions_.to_numpy(),\n        tolerance=tolerance_s * datapoint.sampling_rate_hz,\n    )\n    n_tp = len(matches_events)\n    precision = n_tp / len(pipeline.r_peak_positions_)\n    recall = n_tp / len(datapoint.r_peak_positions_)\n    f1_score = (2 * n_tp) / (len(pipeline.r_peak_positions_) + len(datapoint.r_peak_positions_))\n    return {\"precision\": precision, \"recall\": recall, \"f1_score\": f1_score}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data Splitting\nLike with a normal cross validation, we need to decide on the number of folds and type of splits.\nIn `tpcp` we support all cross validation iterators provided in\n`sklearn <https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators>`__.\n\nTo keep the runtime low for this example, we are going to use a 2-fold CV.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import KFold\n\ncv = KFold(n_splits=2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Parameters\nThe pipeline above exposes a couple of (nested) parameters.\n`min_r_peak_height_over_baseline` is the parameter we want to optimize.\nAll other parameters are effectively hyper-parameters as they change the outcome of the optimization.\nWe could differentiate further and say that only `r_peak_match_tolerance_s` is a true hyper parameter, as it only\neffects the outcome of the optimization, but the `run` method is independent from it.\n`max_heart_rate_bpm` and `high_pass_filter_cutoff_hz` effect both the optimization and `run`.\n\nWe could run the gridsearch over any combination of parameters.\nHowever, to keep things simple, we will only test a couple of values for `high_pass_filter_cutoff_hz`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import ParameterGrid\n\nparameters = ParameterGrid({\"algorithm__high_pass_filter_cutoff_hz\": [0.25, 0.5, 1]})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## GridSearchCV\nSetting up the GridSearchCV object is similar to the normal GridSearch, we just need to add the additional `cv`\nparameter.\nThen we can simply run the search using the `optimize` method.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tpcp.optimize import GridSearchCV\n\ngs = GridSearchCV(pipeline=MyPipeline(), parameter_grid=parameters, scoring=score, cv=cv, return_optimized=\"f1_score\")\ngs = gs.optimize(example_data)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Results\nThe output is also comparable to the output of the GridSearch.\nThe main results are stored in the `cv_results_` parameter.\nBut instead of just a single performance value per parameter, we get one value per fold and the mean and std over\nall folds.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "results = gs.cv_results_\nresults_df = pd.DataFrame(results)\n\nresults_df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The mean score is the primary parameter used to select the best parameter combi (if `return_optimized` is True).\nAll other values performance values are just there to provide further inside.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "results_df[[\"mean_test_precision\", \"mean_test_recall\", \"mean_test_f1_score\"]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For even more insight, you can inspect the scores per datapoint:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "results_df.filter(like=\"test_single\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If `return_optimized` was set to True (or the name of a score), a final optimization is performed using the best\nset of parameters and **all** the available data.\nThe resulting pipeline will be stored in `optimizable_pipeline_`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Best Para Combi:\", gs.best_params_)\nprint(\"Paras of optimized Pipeline:\", gs.optimized_pipeline_.get_params())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To run the optimized pipeline, we can directly use the `run`/`safe_run` method on the GridSearch object.\nThis makes it possible to use the `GridSearch` as a replacement for your pipeline object with minimal code changes.\n\nIf you tried to call `run`/`safe_run` (or `score` for that matter), before the optimization, an error is\nraised.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "r_peaks = gs.safe_run(example_data[0]).r_peak_positions_\nr_peaks"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}