{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Optimizable Pipelines\n\nSome gait analysis algorithms can actively be \"trained\" to improve their performance or adapt it to a certain dataset.\nIn `tpcp` we use the term \"optimize\" instead of \"train\", as not all algorithms are based on \"machine learning\" in the\ntraditional sense.\nWe consider all algorithms/pipelines \"optimizable\" if they have parameters and models that can be adapted and optimized\nusing an algorithm specific optimization method.\nAlgorithms that can **only** be optimized by brute force (e.g. via GridSearch) are explicitly excluded from this group.\nFor more information about the conceptional idea behind this, see the guide on\n`algorithm evaluation <algorithm_evaluation>`.\n\nIn this example we will implement an optimizable pipeline around the `OptimizableQrsDetector` we developed in\n`custom_algorithms_qrs_detection`.\nAs optimization might depend on the dataset and pre-processing, we need to write a wrapper around the `self_optimize`\nmethod of the `OptimizableQrsDetector` on a pipeline level.\nHowever, in general this should be really straight forward, as most of the complexity is already implemented on\nalgorithm level.\n\nThis example shows how such a pipeline should be implemented and how it can be optimized using\n:class:`~tpcp.optimize.Optimize`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Pipeline\nOur pipeline will implement all the logic on how our algorithms are applied to the data and how algorithms should\nbe optimized based on train data.\n\nAn optimizable pipeline usually needs the following things:\n\n1. It needs to be a subclass of :class:`~tpcp.OptimizablePipeline`.\n2. It needs to have a `run` method that runs all the algorithmic steps and stores the results as class attributes.\n   The `run` method should expect only a single data point (in our case a single recording of one sensor) as input.\n3. It needs to have an `self_optimize` method, that performs a data-driven optimization of one or more input\n   parameters.\n   This method is expected to return `self` and is only allowed to modify parameters marked as `OptimizableParameter`\n   using the class-level typehints (more below)\n4. A `init` that defines all parameters that should be adjustable. Note, that the names in the function signature of\n   the `init` method, **must** match the corresponding attribute names (e.g. `max_cost` -> `self.max_cost`).\n   If you want to adjust multiple parameters that all belong to the same algorithm, it might also be convenient to\n   just pass the algorithm as a parameter. However, keep potential issues with mutable defaults in mind (`more\n   info <mutable_defaults>`). As `OptimizableQrsDetector` is a tpcp-algorithm class, we can do that in our case.\n5. At least one of the input parameters must be marked as `OptimizableParameter` in the class-level typehints.\n   If parameters are nested tpcp objects you can use the `__` syntax to mark nested values as optimizable.\n   Note, that you always need to mark the parameters you want to optimize in the current pipeline.\n   Annotations in nested objects are not considered.\n   The more precise you are with these annotations, the more help the runtime checks in tpcp can provide.\n6. (Optionally) Mark parameters as `PureParameter` using the type annotations. This can be used by GridSearchCv to\n   apply some performance optimizations. However, be careful with that! In our case, there are not `PureParameters`,\n   As all (nested) input parameters change the output of the `self_optimize` method.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n\nfrom examples.algorithms.algorithms_qrs_detection_final import OptimizableQrsDetector\nfrom examples.datasets.datasets_final_ecg import ECGExampleData\nfrom tpcp import OptimizableParameter, OptimizablePipeline, Parameter, cf\n\n\nclass MyPipeline(OptimizablePipeline[ECGExampleData]):\n    algorithm: Parameter[OptimizableQrsDetector]\n    algorithm__min_r_peak_height_over_baseline: OptimizableParameter[float]\n\n    r_peak_positions_: pd.Series\n\n    def __init__(self, algorithm: OptimizableQrsDetector = cf(OptimizableQrsDetector())):\n        self.algorithm = algorithm\n\n    def self_optimize(self, dataset: ECGExampleData, **kwargs):\n        ecg_data = [d.data[\"ecg\"] for d in dataset]\n        r_peaks = [d.r_peak_positions_[\"r_peak_position\"] for d in dataset]\n        # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.\n        algo = self.algorithm.clone()\n        self.algorithm = algo.self_optimize(ecg_data, r_peaks, dataset.sampling_rate_hz)\n        return self\n\n    def run(self, datapoint: ECGExampleData):\n        # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.\n        algo = self.algorithm.clone()\n        algo.detect(datapoint.data[\"ecg\"], datapoint.sampling_rate_hz)\n\n        self.r_peak_positions_ = algo.r_peak_positions_\n        return self\n\n\npipe = MyPipeline()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Comparison\nTo see the effect of the optimization, we will compare the output of the optimized pipeline with the output of the\ndefault pipeline.\nAs it is not the goal of this example to perform any form of actual evaluation of a model, we will just compare the\nnumber of identified R-peaks to show, that the optimization had an impact on the output.\n\nFor a fair comparison, we must use some train data to optimize the pipeline and then compare the outputs only on a\nseparate test set.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from pathlib import Path\n\nfrom sklearn.model_selection import train_test_split\n\ntry:\n    HERE = Path(__file__).parent\nexcept NameError:\n    HERE = Path(\".\").resolve()\ndata_path = HERE.parent.parent / \"example_data/ecg_mit_bih_arrhythmia/data\"\nexample_data = ECGExampleData(data_path)\n\ntrain_set, test_set = train_test_split(example_data, train_size=0.7, random_state=0)\n# We only want a single dataset in the test set\ntest_set = test_set[0]\n(train_set.groups, test_set.groups)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Baseline\nFor our baseline, we will use the pipeline, but will not apply the optimization.\nThis means, the pipeline will use the default threshold.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pipeline = MyPipeline()\n\n# We use the `safe_run` wrapper instead of just run. This is always a good idea.\nresults = pipeline.safe_run(test_set)\nprint(\"The default `min_r_peak_height_over_baseline` is\", pipeline.algorithm.min_r_peak_height_over_baseline)\nprint(\"Number of R-Peaks:\", len(results.r_peak_positions_))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Optimization\nTo optimize the pipeline, we will **not** call `self_optimize` directly, but use the\n:class:`~tpcp.optimize.Optimize` wrapper.\nIt has the same interface as other optimization methods like :class:`~tpcp.optimize.GridSearch`.\nFurther, it makes some checks to catch potential implementation errors of our `self_optimize` method.\n\nNote, that the optimize method will perform all optimizations on a copy of the pipeline.\nThe means the pipeline object used as input will not be modified.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tpcp.optimize import Optimize\n\n# Remember we only optimize on the `train_set`.\noptimized_pipe = Optimize(pipeline).optimize(train_set)\noptimized_results = optimized_pipe.safe_run(test_set)\nprint(\"The optimized `min_r_peak_height_over_baseline` is\", optimized_results.algorithm.min_r_peak_height_over_baseline)\nprint(\"Number of R-Peaks:\", len(optimized_results.r_peak_positions_))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that training has drastically modified the threshold and increased the number of R-peaks we detected.\nTo figure our, if all the new R-peaks are actually correct, we would need to make a more extensive evaluation.\n\n\n## Final Notes\nIn this example we only modified a threshold of the algorithm.\nHowever, the concept of optimization can be expanded to anything imaginable (e.g. templates, ML-models, NN-models).\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}