{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Cross Validation\n\nWhenever using some sort of trainable algorithm it is important to clearly separate the training and the testing data to\nget an unbiased result.\nUsually this is achieved by a train-test split.\nHowever, if you don't have that much data, there is always a risk that one random train-test split, will provide\nbetter (or worse) results than another.\nIn these cases it is a good idea to use cross-validation.\nIn this procedure, you perform multiple train-test splits and average the results over all \"folds\".\nFor more information see our `evaluation guide <algorithm_evaluation>` and the [sklearn guide on cross\nvalidation](https://scikit-learn.org/stable/modules/cross_validation.html).\n\nIn this example, we will learn how to use the :func:`~tpcp.optimize.cross_validate` function implemented in\ntcpc.\nFor this, we will redo the example on `optimizable pipelines <optimize_pipelines>` but we will perform the final\nevaluation via cross-validation.\nIf you want to have more information on how the dataset and pipeline is built, head over to this example.\nHere we will just copy the code over.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Dataset\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from pathlib import Path\n\nimport numpy as np\n\nfrom examples.datasets.datasets_final_ecg import ECGExampleData\n\ntry:\n    HERE = Path(__file__).parent\nexcept NameError:\n    HERE = Path(\".\").resolve()\ndata_path = HERE.parent.parent / \"example_data/ecg_mit_bih_arrhythmia/data\"\nexample_data = ECGExampleData(data_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Pipeline\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n\nfrom examples.algorithms.algorithms_qrs_detection_final import OptimizableQrsDetector\nfrom tpcp import OptimizableParameter, OptimizablePipeline, Parameter, cf\n\n\nclass MyPipeline(OptimizablePipeline):\n    algorithm: Parameter[OptimizableQrsDetector]\n    algorithm__min_r_peak_height_over_baseline: OptimizableParameter[float]\n\n    r_peak_positions_: pd.Series\n\n    def __init__(self, algorithm: OptimizableQrsDetector = cf(OptimizableQrsDetector())):\n        self.algorithm = algorithm\n\n    def self_optimize(self, dataset: ECGExampleData, **kwargs):\n        ecg_data = [d.data[\"ecg\"] for d in dataset]\n        r_peaks = [d.r_peak_positions_[\"r_peak_position\"] for d in dataset]\n        # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.\n        algo = self.algorithm.clone()\n        self.algorithm = algo.self_optimize(ecg_data, r_peaks, dataset.sampling_rate_hz)\n        return self\n\n    def run(self, datapoint: ECGExampleData):\n        # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.\n        algo = self.algorithm.clone()\n        algo.detect(datapoint.data, datapoint.sampling_rate_hz)\n\n        self.r_peak_positions_ = algo.r_peak_positions_\n        return self"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The Scorer\nThe scorer is identical to the scoring function used in the other examples.\nThe F1-score is still the most important parameter for our comparison.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from examples.algorithms.algorithms_qrs_detection_final import match_events_with_reference, precision_recall_f1_score\n\n\ndef score(pipeline: MyPipeline, datapoint: ECGExampleData):\n    # We use the `safe_run` wrapper instead of just run. This is always a good idea.\n    # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`\n    # will clone it again.\n    pipeline = pipeline.safe_run(datapoint)\n    tolerance_s = 0.02  # We just use 20 ms for this example\n    matches = match_events_with_reference(\n        pipeline.r_peak_positions_.to_numpy(),\n        datapoint.r_peak_positions_.to_numpy(),\n        tolerance=tolerance_s * datapoint.sampling_rate_hz,\n    )\n    precision, recall, f1_score = precision_recall_f1_score(matches)\n    return {\"precision\": precision, \"recall\": recall, \"f1_score\": f1_score}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Data Splitting\nBefore performing a cross validation, we need to decide on the number of folds and type of splits.\nIn `tpcp` we support all cross validation iterators provided in\n[sklearn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators)_.\n\nTo keep the runtime low for this example, we are going to use a 3-fold CV.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import KFold\n\ncv = KFold(n_splits=3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Cross Validation\nNow we have all the pieces for the final cross validation.\nFirst we need to create instances of our data and pipeline.\nThen we need to wrap our pipeline instance into an :class:`~tpcp.optimize.Optimize` wrapper.\nFinally, we can call `tpcp.validate.cross_validate`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tpcp.optimize import Optimize\nfrom tpcp.validate import cross_validate\n\npipe = MyPipeline()\noptimizable_pipe = Optimize(pipe)\n\nresults = cross_validate(\n    optimizable_pipe, example_data, scoring=score, cv=cv, return_optimizer=True, return_train_score=True\n)\nresult_df = pd.DataFrame(results)\nresult_df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Understanding the Results\nThe cross validation provides a lot of outputs (some of them can be disabled using the function parameters).\nTo simplify things a little, we will split the output into four parts:\n\nThe main output are the test set performance values.\nEach row corresponds to performance in respective fold.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "performance = result_df[[\"test_precision\", \"test_recall\", \"test_f1_score\"]]\nperformance"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The final generalization performance you would report is usually the average over all folds.\nThe STD can also be interesting, as it tells you how stable your optimization is and if your splits provide\ncomparable data distributions.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "generalization_performance = performance.agg([\"mean\", \"std\"])\ngeneralization_performance"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you need more insight into the results (e.g. when the std of your results is high), you can inspect the\nindividual score for each data point.\nIn this example this is only a list with a single element per score, as we only had a single datapoint per fold.\nIn a real scenario, this will be a list of all datapoints.\nInspecting this list can help to identify potential issues with certain parts of your dataset.\nTo link the performance values to a specific datapoint, you can look at the `test_data_labels` field.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "single_performance = result_df[\n    [\"test_single_precision\", \"test_single_recall\", \"test_single_f1_score\", \"test_data_labels\"]\n]\nsingle_performance"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Even further insight is provided by the train results (if activated in parameters).\nThese are the performance results on the train set and can indicate if the training provided meaningful results and\ncan also indicate over-fitting, if the performance of the test set is much worse than the performance on the train\nset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "train_performance = result_df[\n    [\n        \"train_precision\",\n        \"train_recall\",\n        \"train_f1_score\",\n        \"train_single_precision\",\n        \"train_single_recall\",\n        \"train_single_f1_score\",\n        \"train_data_labels\",\n    ]\n]\ntrain_performance"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The final level of debug information is provided via the timings (note the long runtime in fold 0 can be explained\nby the jit-compiler used in `BarthDtw`) ...\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "timings = result_df[[\"score_time\", \"optimize_time\"]]\ntimings"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "... and the optimized pipeline object.\nThis is the actual trained object generated in this fold.\nYou can apply it to other data for testing or inspect the actual object for further debug information that might be\nstored on it.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimized_pipeline = result_df[\"optimizer\"][0]\noptimized_pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimized_pipeline.optimized_pipeline_.get_params()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Further Notes\nWe also support grouped cross validation.\nCheck the `dataset guide <custom_dataset_basics>` on how you can group the data before cross-validation or\ngenerate data labels to be used with `GroupedKFold`.\n\n`Optimize` is just an example of an optimizer that can be passed to cross validation.\nYou can pass any `tpcp` optimizer like `GridSearch` or `GridSearchCV` or custom optimizer that implement the\n`tpcp.optimize.BaseOptimize` interface.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.15"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}