.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/validation/_03_custom_scorer.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_validation__03_custom_scorer.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_validation__03_custom_scorer.py:


.. _custom_scorer:

Custom Scorer
=============

Scorer or scoring functions are used in tpcp whenever we need to rank any form of output.
For examples, after a GridSearch, we want to know which pipeline is the best.
This is done by a function, that takes a pipeline and a datapoint as an input and returns one or multiple score.
These scores are then averaged over all datapoints provided.

However, sometimes this is not exactly what we want.
In this case, you need to create a custom scorer or custom aggregator to also control how scores are averaged over all
datapoints.

Four general usecases arise for custom scorers:

1. You actually don't want to score anything, but just want to collect some metadata, or pass results out of the method
   unchanged for later analysis. This can be easily done using :func:`~tpcp.validate.no_agg` (See first example below)
2. You can properly calculate a performance value on a single datapoint, but you don't want to take the mean over all
   datapoints, but rather use a different aggregation metrics (e.g. median, ...).
   This can be done by using the existing :class:`~tpcp.validate.FloatAggregator` class with a new function (See second
   and third example below)
3. Similar to 3, but you require additional information passed through the aggregation function. This could be the
   datapoints itself (e.g. to calculate a Macro Average) or some other metadata required for the aggregation.
   This can be done by inheriting from the :class:`~tpcp.validate.Aggregator` class and implementing the `aggregate`
   method (See fourth example below).
4. You want to calculate a score, that can not be first aggregated on a datapoint level.
   For example, you are detecting events in a dataset and you want to calculate the F1 score across all events of a
   dataset, without first aggregating the F1 score on a datapoint level.

.. GENERATED FROM PYTHON SOURCE LINES 33-37

.. code-block:: default


    from collections.abc import Sequence
    from pathlib import Path


.. GENERATED FROM PYTHON SOURCE LINES 38-42

Setup
-----
We will simply reuse the pipline from the general QRS detection example.
For all of our custom scorer, we will use this pipeline and apply it to all datapoints of the ECG example dataset.

.. GENERATED FROM PYTHON SOURCE LINES 42-82

.. code-block:: default

    from examples.algorithms.algorithms_qrs_detection_final import (
        match_events_with_reference,
    )
    from examples.datasets.datasets_final_ecg import ECGExampleData

    try:
        HERE = Path(__file__).parent
    except NameError:
        HERE = Path().resolve()
    data_path = HERE.parent.parent / "example_data/ecg_mit_bih_arrhythmia/data"
    example_data = ECGExampleData(data_path)

    import pandas as pd
    from joblib.memory import Memory
    from tpcp import Parameter, Pipeline, cf

    from examples.algorithms.algorithms_qrs_detection_final import (
        QRSDetector,
        precision_recall_f1_score,
    )
    from examples.datasets.datasets_final_ecg import ECGExampleData


    class MyPipeline(Pipeline[ECGExampleData]):
        algorithm: Parameter[QRSDetector]

        r_peak_positions_: pd.Series

        def __init__(self, algorithm: QRSDetector = cf(QRSDetector())):
            self.algorithm = algorithm

        def run(self, datapoint: ECGExampleData):
            # Note: We need to clone the algorithm instance, to make sure we don't leak any data between runs.
            algo = self.algorithm.clone()
            algo.detect(datapoint.data["ecg"], datapoint.sampling_rate_hz)

            self.r_peak_positions_ = algo.r_peak_positions_
            return self


.. GENERATED FROM PYTHON SOURCE LINES 83-84

We set up a global cache for our pipeline to speed up the repeated evaluation we do below.

.. GENERATED FROM PYTHON SOURCE LINES 84-94

.. code-block:: default

    from tpcp.caching import global_disk_cache

    global_disk_cache(
        memory=Memory("./.cache"),
        restore_in_parallel_process=True,
        action_method_name="run",
    )(MyPipeline)

    pipe = MyPipeline()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/docs/checkouts/readthedocs.org/user_builds/tpcp/checkouts/v2.2.0/examples/validation/_03_custom_scorer.py:86: UserWarning: Global caching is a little tricky to get right and our implementation is not yet battle-tested. Please double check that the results are correct and report any issues you find.
      global_disk_cache(


.. GENERATED FROM PYTHON SOURCE LINES 95-110

No Aggregation
--------------
Sometimes you might want to return data from a score function that should not be aggregated.
This could be arbitrary metadata or scores will value that can not be averaged.
In this case you can simply use the :func:`~tpcp.validate.no_agg` aggregator.
This will return only the single values and no aggregated items.

In the example below, we will calculate the precision, recall and f1-score for each datapoint and in addition return
the number of labeled reference values as "metadata".
This metadata will not be aggregated, but still be available in the single results.

.. note:: At the moment we don't support returning only no-aggregated from a scorer.
          At least one value must be aggregated, so that it can be used to rank results.
          If you really need this (e.g. in combination with :func:`~tpcp.validate.validate`), you can return a dummy
          value that is not used in the aggregation.

.. GENERATED FROM PYTHON SOURCE LINES 110-133

.. code-block:: default

    from tpcp.validate import no_agg


    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "n_labels": no_agg(len(datapoint.r_peak_positions_)),
        }


.. GENERATED FROM PYTHON SOURCE LINES 134-135

We can see that the n_labels is not contained in the aggregated results.

.. GENERATED FROM PYTHON SOURCE LINES 135-140

.. code-block:: default

    from tpcp.validate import Scorer

    no_agg_agg, no_agg_single = Scorer(score)(pipe, example_data)
    no_agg_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_1         100)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    ________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_2         102)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    Datapoints:  17%|█▋        | 2/12 [00:00<00:00, 16.43it/s]________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_3         104)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    ________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_1         105)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    Datapoints:  33%|███▎      | 4/12 [00:00<00:00, 17.57it/s]________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_2         106)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    ________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_3         108)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    Datapoints:  50%|█████     | 6/12 [00:00<00:00, 18.19it/s]________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_1         114)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    ________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_2         116)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    Datapoints:  67%|██████▋   | 8/12 [00:00<00:00, 18.39it/s]________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_3         119)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    ________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_1         121)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    Datapoints:  83%|████████▎ | 10/12 [00:00<00:00, 16.53it/s]________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_2         123)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    ________________________________________________________________________________
    [Memory] Calling tpcp.caching.global_disk_cache.<locals>.inner.<locals>.cached_action_method...
    cached_action_method(MyPipeline(algorithm=QRSDetector(high_pass_filter_cutoff_hz=1, max_heart_rate_bpm=200.0, min_r_peak_height_over_baseline=1.0)), 
    None, 'run', ECGExampleData [1 groups/rows]

         patient_group participant
       0       group_3         200)
    _____________________________________________cached_action_method - 0.0s, 0.0min
    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 15.49it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 16.44it/s]

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107}


.. GENERATED FROM PYTHON SOURCE LINES 141-142

But we can still access the value in the single results.

.. GENERATED FROM PYTHON SOURCE LINES 142-144

.. code-block:: default

    no_agg_single["n_labels"]


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [2273, 2187, 2229, 2572, 2027, 1763, 1879, 2412, 1987, 1863, 1518, 2601]


.. GENERATED FROM PYTHON SOURCE LINES 145-156

Custom Median Scorer
--------------------
If we want to change the way the scores are aggregated, we can use a custom aggregator.
For simple cases, this does not require to implement a new class, but we can use the
:class:`~tpcp.validate.FloatAggregator` directly.
It assumes that we have a function that takes a sequence of floats and returns a float.

Aggregators are simply instances of the :class:`~tpcp.validate.Aggregator` classes.
So we can create a new instance of the :class:`~tpcp.validate.FloatAggregator` with a new function.

Below we simply use the median as an example.

.. GENERATED FROM PYTHON SOURCE LINES 156-161

.. code-block:: default

    import numpy as np
    from tpcp.validate import FloatAggregator

    median_agg = FloatAggregator(np.median)


.. GENERATED FROM PYTHON SOURCE LINES 162-164

Then we reuse the score function from before and wrap the F1-score with the median aggregator.
For all other values, the default aggregator will be used (which is the mean).

.. GENERATED FROM PYTHON SOURCE LINES 164-188

.. code-block:: default


    # .. warning:: Note, that you score function must return the same aggregator for a scores across all datapoints.
    #              If not, we will raise an error!
    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "median_f1_score": median_agg(f1_score),
        }


.. GENERATED FROM PYTHON SOURCE LINES 189-192

.. code-block:: default

    median_results_agg, median_results_single = Scorer(score)(pipe, example_data)
    median_results_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 63.66it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 64.91it/s]

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107, 'median_f1_score': 0.9173713364633038}


.. GENERATED FROM PYTHON SOURCE LINES 193-200

.. code-block:: default

    assert median_results_agg["median_f1_score"] == np.median(
        median_results_single["f1_score"]
    )
    assert median_results_agg["f1_score"] == np.mean(
        median_results_single["f1_score"]
    )


.. GENERATED FROM PYTHON SOURCE LINES 201-204

.. note:: We could also change the default aggregator for all scores by using the `default_aggregator` parameter of
          the :class:`~tpcp.validate.Scorer` class (See the next example).
Let's start with the first way.

.. GENERATED FROM PYTHON SOURCE LINES 204-208

.. code-block:: default

    all_median_results_agg, all_median_results_single = Scorer(
        score, default_aggregator=median_agg
    )(pipe, example_data)
    median_results_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 61.90it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 63.54it/s]

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107, 'median_f1_score': 0.9173713364633038}


.. GENERATED FROM PYTHON SOURCE LINES 209-210

We can see via the log-printing that the aggregator was called 3 times (once per score).

.. GENERATED FROM PYTHON SOURCE LINES 210-217

.. code-block:: default

    assert all_median_results_agg["f1_score"] == np.median(
        all_median_results_single["f1_score"]
    )
    assert all_median_results_agg["precision"] == np.median(
        all_median_results_single["precision"]
    )


.. GENERATED FROM PYTHON SOURCE LINES 218-228

Multi-Return Aggregator
-----------------------
Sometimes an aggregator needs to return multiple values.
We can easily do that, by returning a dict from the `aggregate` method or in case of the `FloatAggregator` by passing
a function that returns a dict.

As example, we will calculate the mean and standard deviation of the returned scores in one aggregation.
This could be applied individually to each score (as seen in the previous example) or to all scores at once using
the `default_aggregator` parameter.
We will demonstrate the latter here.

.. GENERATED FROM PYTHON SOURCE LINES 228-256

.. code-block:: default


    def mean_and_std(vals: Sequence[float]):
        return {"mean": float(np.mean(vals)), "std": float(np.std(vals))}


    mean_and_std_agg = FloatAggregator(mean_and_std)


    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {"precision": precision, "recall": recall, "f1_score": f1_score}


    multi_agg_agg, multi_agg_single = Scorer(
        score, default_aggregator=mean_and_std_agg
    )(pipe, example_data)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 64.75it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 66.09it/s]


.. GENERATED FROM PYTHON SOURCE LINES 257-258

When multiple values are returned, the names are concatenated with the names of the scores using `__`.

.. GENERATED FROM PYTHON SOURCE LINES 258-260

.. code-block:: default

    multi_agg_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    {'precision__mean': 0.9929358534618008, 'precision__std': 0.009342032241600755, 'recall__mean': 0.6737755326205007, 'recall__std': 0.39661575634936475, 'f1_score__mean': 0.7089727629059107, 'f1_score__std': 0.39387732846763174}


.. GENERATED FROM PYTHON SOURCE LINES 261-273

Macro Aggregation
-----------------
In some datasets (in particular, when we have multiple recordings per participant), we might want to calculate a
single performance value for each participant and then average these values.
Fundamentally, this is a little tricky with tpcp, as all of our processing happens per datapoint, and each datapoint
is usually one recording, to simplify the pipeline structures.

Hence, we need to shift some of the aggregation complexity into our scoring function.
As this is a little complicated and such a common usecase, tpcp provides a helper class for this:
:class:`~tpcp.validate.MacroFloatAggregator`.
It allows us to define an initial grouping based on the dataset index columns and define how values are aggregated
first per group and then across all groups.

.. GENERATED FROM PYTHON SOURCE LINES 273-279

.. code-block:: default

    from tpcp.validate import MacroFloatAggregator

    macro_average_patient_group = MacroFloatAggregator(
        groupby="patient_group", group_agg=np.mean, final_agg=np.mean
    )


.. GENERATED FROM PYTHON SOURCE LINES 280-281

We will apply this aggregation to the F1-score:

.. GENERATED FROM PYTHON SOURCE LINES 281-302

.. code-block:: default


    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": macro_average_patient_group(f1_score),
        }


.. GENERATED FROM PYTHON SOURCE LINES 303-305

We can see that we now get the "single" aggregation values per group (`f1_score__{group}`) and the final aggregated
values (`f1_score__macro`).

.. GENERATED FROM PYTHON SOURCE LINES 305-308

.. code-block:: default

    macro_agg, macro_single = Scorer(score)(pipe, example_data)
    macro_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 64.01it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 65.40it/s]
    /home/docs/checkouts/readthedocs.org/user_builds/tpcp/checkouts/v2.2.0/src/tpcp/validate/_scorer.py:216: FutureWarning: The provided callable <function mean at 0x74d83d134d30> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
      per_group = data.groupby(self.groupby).agg(self.group_agg)

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score__group_1': 0.5039949762609272, 'f1_score__group_2': 0.9405469879110494, 'f1_score__group_3': 0.6823763245457557, 'f1_score__macro': np.float64(0.7089727629059107)}


.. GENERATED FROM PYTHON SOURCE LINES 309-310

The raw values are still available in the single results.

.. GENERATED FROM PYTHON SOURCE LINES 310-312

.. code-block:: default

    macro_single["f1_score"]


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [np.float64(0.9993396434074401), np.float64(0.8673338465486272), np.float64(0.9336437718277065), np.float64(0.9787896477913991), np.float64(0.9010989010989011), np.float64(0.08473655621944595), np.float64(0.03143006809848088), np.float64(0.9937552039966694), np.float64(0.99874213836478), np.float64(0.006420545746388443), np.float64(1.0), np.float64(0.7123828317710903)]


.. GENERATED FROM PYTHON SOURCE LINES 313-317

So far we did not need to implement a fully custom aggregation, as we `tpcp` could provide helper funcs for typical
usecases.
However, if you need to do more complicated things with your score values, or pass other things than floats to your
scores, you will need a custom aggregator as shown in the next example.

.. GENERATED FROM PYTHON SOURCE LINES 319-343

Fully Custom Aggregation
------------------------
In the next example, we want to aggregate on a "lower" level than a single datapoint.
In the previous example, where we wanted to aggregate first on a "higher" level than a single datapoint.
In this case we could provide tpcp-helper, as the higher levels were defined by the used `Dataset`.
Hence, we could make some assumptions about how the passed data will look like.

However, if you want to go more granular as a single datapoint, we can not know what datastructures you are dealing
with.
Therefore, you need to create a completely custom aggregation by subclassing :class:`~tpcp.validate.Aggregator`.

Below we show an example, where we calculate the precision, recall and f1-score without aggregating on a datapoint
level, but rather first combining all predictions and references across all datapoints before calculating the
precision, recall and f1-score.

There are no restrictions on the data you can pass from the scorer.
Only the aggregator needs to be able to handle the values and then return a float or a dict with float values.

In this example, we will use a custom aggregator to calculate the precision, recall and f1-score without
aggregating on a datapoint level first.
For that we return the raw `matches` from the score function and wrap them into an aggregator that concatenates all
of them, before throwing them into the `precision_recall_f1_score` function.

Note, that the actual aggregation is an instance of our custom class, NOT the class itself.

.. GENERATED FROM PYTHON SOURCE LINES 343-380

.. code-block:: default

    from tpcp.validate import Aggregator


    class SingleValuePrecisionRecallF1(Aggregator[np.ndarray]):
        def aggregate(
            self, /, values: Sequence[np.ndarray], **_
        ) -> dict[str, float]:
            print("SingleValuePrecisionRecallF1 Aggregator called")
            precision, recall, f1_score = precision_recall_f1_score(
                np.vstack(values)
            )
            return {"precision": precision, "recall": recall, "f1_score": f1_score}


    single_value_precision_recall_f1_agg = SingleValuePrecisionRecallF1()


    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "per_sample": single_value_precision_recall_f1_agg(matches),
        }


.. GENERATED FROM PYTHON SOURCE LINES 381-384

We can see that we now get the values per datapoint (as before) and the values without previous aggregation.
From a scientific perspective, we can see that these values are quite different.
Again, which version to choose for scoring will depend on the use case.

.. GENERATED FROM PYTHON SOURCE LINES 384-387

.. code-block:: default

    complicated_agg, complicated_single = Scorer(score)(pipe, example_data)
    complicated_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 63.69it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 65.77it/s]
    SingleValuePrecisionRecallF1 Aggregator called

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107, 'per_sample__precision': np.float64(0.990271060623102), 'per_sample__recall': np.float64(0.6957054245189839), 'per_sample__f1_score': np.float64(0.8172557027823545)}


.. GENERATED FROM PYTHON SOURCE LINES 388-389

The raw matches array is still available in the `single` results.

.. GENERATED FROM PYTHON SOURCE LINES 389-391

.. code-block:: default

    complicated_single["per_sample"]


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [array([[0.000e+00, 0.000e+00],
           [1.000e+00, 1.000e+00],
           [2.000e+00, 2.000e+00],
           ...,
           [      nan, 3.670e+02],
           [      nan, 4.930e+02],
           [      nan, 1.906e+03]], shape=(2273, 2)), array([[0.000e+00, 1.000e+00],
           [1.000e+00, 2.000e+00],
           [2.000e+00, 3.000e+00],
           ...,
           [      nan, 2.171e+03],
           [      nan, 2.175e+03],
           [      nan, 2.179e+03]], shape=(2207, 2)), array([[1.000e+00, 1.000e+00],
           [2.000e+00, 2.000e+00],
           [3.000e+00, 3.000e+00],
           ...,
           [      nan, 2.201e+03],
           [      nan, 2.202e+03],
           [      nan, 2.222e+03]], shape=(2290, 2)), array([[0.000e+00, 0.000e+00],
           [1.000e+00, 1.000e+00],
           [2.000e+00, 2.000e+00],
           ...,
           [      nan, 2.417e+03],
           [      nan, 2.443e+03],
           [      nan, 2.445e+03]], shape=(2624, 2)), array([[0.000e+00, 0.000e+00],
           [1.000e+00, 1.000e+00],
           [2.000e+00, 2.000e+00],
           ...,
           [      nan, 2.012e+03],
           [      nan, 2.014e+03],
           [      nan, 2.020e+03]], shape=(2050, 2)), array([[0.000e+00, 3.000e+01],
           [1.000e+00, 4.380e+02],
           [2.000e+00, 4.410e+02],
           ...,
           [      nan, 1.760e+03],
           [      nan, 1.761e+03],
           [      nan, 1.762e+03]], shape=(1763, 2)), array([[0.000e+00, 7.430e+02],
           [1.000e+00, 7.440e+02],
           [2.000e+00, 7.450e+02],
           ...,
           [      nan, 1.876e+03],
           [      nan, 1.877e+03],
           [      nan, 1.878e+03]], shape=(1879, 2)), array([[1.000e+00, 0.000e+00],
           [2.000e+00, 1.000e+00],
           [3.000e+00, 2.000e+00],
           ...,
           [      nan, 1.848e+03],
           [      nan, 1.859e+03],
           [      nan, 1.860e+03]], shape=(2417, 2)), array([[0.000e+00, 0.000e+00],
           [1.000e+00, 1.000e+00],
           [2.000e+00, 2.000e+00],
           ...,
           [1.987e+03,       nan],
           [      nan, 3.100e+01],
           [      nan, 1.699e+03]], shape=(1990, 2)), array([[0.000e+00, 4.000e+00],
           [1.000e+00, 2.480e+02],
           [2.000e+00, 2.610e+02],
           ...,
           [      nan, 1.860e+03],
           [      nan, 1.861e+03],
           [      nan, 1.862e+03]], shape=(1863, 2)), array([[0.000e+00, 0.000e+00],
           [1.000e+00, 1.000e+00],
           [2.000e+00, 2.000e+00],
           ...,
           [1.515e+03, 1.515e+03],
           [1.516e+03, 1.516e+03],
           [1.517e+03, 1.517e+03]], shape=(1518, 2)), array([[0.000e+00, 1.000e+00],
           [1.000e+00, 3.000e+00],
           [2.000e+00, 5.000e+00],
           ...,
           [      nan, 2.596e+03],
           [      nan, 2.597e+03],
           [      nan, 2.599e+03]], shape=(2610, 2))]


.. GENERATED FROM PYTHON SOURCE LINES 392-394

However, we can customize this behaviour for our aggregator by creating an instance of the aggregator in which we set
`return_raw_scores` class variable to False for our specific usecase.

.. GENERATED FROM PYTHON SOURCE LINES 394-419

.. code-block:: default

    single_value_precision_recall_f1_agg_no_raw = SingleValuePrecisionRecallF1(
        return_raw_scores=False
    )


    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "per_sample": single_value_precision_recall_f1_agg_no_raw(matches),
        }


.. GENERATED FROM PYTHON SOURCE LINES 420-423

Now we can see that the raw matches array is not returned anymore.
In case of a single scorer, the single return value would just be `None`, instead of a dict with the respective key
missing.

.. GENERATED FROM PYTHON SOURCE LINES 423-429

.. code-block:: default

    complicated_agg_no_raw, complicated_single_no_raw = Scorer(score)(
        pipe, example_data
    )
    complicated_single_no_raw.keys()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 64.05it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 66.00it/s]
    SingleValuePrecisionRecallF1 Aggregator called

    dict_keys(['precision', 'recall', 'f1_score'])


.. GENERATED FROM PYTHON SOURCE LINES 430-441

Generalizing the custom aggregator
----------------------------------
In the previous examples, that can calculate values after concatenating all values.
However, it only works for the precision, recall and f1-score.
We can generalize this, by extracting the calculation of the precision, recall and f1-score into a parameter of
the aggregator.
This way, we can use the same aggregator for different scores.

Note, that we don't provide such generalized aggregators in tpcp on purpose, as they really depend on the specific
usecase, your data, and the type of scores you want to calculate.
Hence, we recommend to use these examples as a starting point to implement your own custom aggregators.

.. GENERATED FROM PYTHON SOURCE LINES 441-460

.. code-block:: default

    from typing import Callable, Union


    class SingleValueAggregator(Aggregator[np.ndarray]):
        def __init__(
            self,
            func: Callable[[Sequence[np.ndarray]], Union[float, dict[str, float]]],
            *,
            return_raw_scores: bool = True,
        ):
            self.func = func
            super().__init__(return_raw_scores=return_raw_scores)

        def aggregate(
            self, /, values: Sequence[np.ndarray], **_
        ) -> dict[str, float]:
            return self.func(np.vstack(values))


.. GENERATED FROM PYTHON SOURCE LINES 461-462

With this our aggregator from before becomes just a special case of the new aggregator.

.. GENERATED FROM PYTHON SOURCE LINES 462-497

.. code-block:: default

    def calculate_precision_recall_f1(
        matches: Sequence[np.ndarray],
    ) -> dict[str, float]:
        precision, recall, f1_score = precision_recall_f1_score(np.vstack(matches))
        return {"precision": precision, "recall": recall, "f1_score": f1_score}


    single_value_precision_recall_f1_agg_from_gen = SingleValueAggregator(
        calculate_precision_recall_f1
    )


    def score(pipeline: MyPipeline, datapoint: ECGExampleData):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        tolerance_s = 0.02  # We just use 20 ms for this example
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "per_sample": single_value_precision_recall_f1_agg_from_gen(matches),
        }


    complicated_agg, complicated_single = Scorer(score)(pipe, example_data)
    complicated_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 64.21it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 66.02it/s]

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107, 'per_sample__precision': np.float64(0.990271060623102), 'per_sample__recall': np.float64(0.6957054245189839), 'per_sample__f1_score': np.float64(0.8172557027823545)}


.. GENERATED FROM PYTHON SOURCE LINES 498-509

We can even move the initialization of the aggregator into the score function, or pass the Aggregator itself as a
parameter.
This allows us to make the score function itself generalizable.

This works, because we check if the aggregators all have the same config, but we don't enforce them to all be the
same object.

This allows for quite powerful and flexible scoring functions that we could then use with `partial` to create
different versions of the score function.

While we are at it, we also make the `tolerance_s` a parameter of the score function.

.. GENERATED FROM PYTHON SOURCE LINES 509-536

.. code-block:: default


    def score(
        pipeline: MyPipeline,
        datapoint: ECGExampleData,
        *,
        tolerance_s: float,
        per_sample_agg: Aggregator[np.ndarray],
    ):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "per_sample": per_sample_agg(matches),
        }


.. GENERATED FROM PYTHON SOURCE LINES 537-538

With that we can reconstruct the `return_raw_scores=False` behaviour from before using a partial or a lambda

.. GENERATED FROM PYTHON SOURCE LINES 538-554

.. code-block:: default

    from functools import partial

    complicated_agg_no_raw, complicated_single_no_raw = Scorer(
        partial(
            score,
            per_sample_agg=SingleValueAggregator(
                calculate_precision_recall_f1, return_raw_scores=False
            ),
            tolerance_s=0.02,
        ),
        # Note: You could also run this with multiple jobs, but this creates issues with the way we test the examples.
        n_jobs=1,
    )(pipe, example_data)

    complicated_agg_no_raw


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 64.03it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 65.90it/s]

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107, 'per_sample__precision': np.float64(0.990271060623102), 'per_sample__recall': np.float64(0.6957054245189839), 'per_sample__f1_score': np.float64(0.8172557027823545)}


.. GENERATED FROM PYTHON SOURCE LINES 555-558

.. code-block:: default

    complicated_single_no_raw.keys()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    dict_keys(['precision', 'recall', 'f1_score'])


.. GENERATED FROM PYTHON SOURCE LINES 559-575

After Score Function
--------------------
While writing custom aggregators is a nice way of configuring your scoring on the level of single datapoints, at some
point you might be hiding a lot of complexity in these custom aggregators, that is not easy to understand and
discover.
In this case we have the `final_aggregator` parameter of the :class:`~tpcp.validate.Scorer` class as escape
hatch.
This function is called with the output of the score function, the pipeline object and the dataset.
So you should have all the information available to perform any kind of aggregation you want.

Here we will demonstrate, how you could implement the custom scorer example from above in a `after_score_function`.
What method you prefer is up to you.

For this to work, we pass the raw matches out of the score function to use them in the final aggregation.
There we have full control over the results and can calculate the precision, recall and f1-score across all
datapoints and remove the raw matches from the final results.

.. GENERATED FROM PYTHON SOURCE LINES 575-619

.. code-block:: default


    def score(
        pipeline: MyPipeline, datapoint: ECGExampleData, *, tolerance_s: float
    ):
        # We use the `safe_run` wrapper instead of just run. This is always a good idea.
        # We don't need to clone the pipeline here, as GridSearch will already clone the pipeline internally and `run`
        # will clone it again.
        pipeline = pipeline.safe_run(datapoint)
        matches = match_events_with_reference(
            pipeline.r_peak_positions_.to_numpy(),
            datapoint.r_peak_positions_.to_numpy(),
            tolerance=tolerance_s * datapoint.sampling_rate_hz,
        )
        precision, recall, f1_score = precision_recall_f1_score(matches)
        return {
            "precision": precision,
            "recall": recall,
            "f1_score": f1_score,
            "_raw": no_agg(matches),
        }


    def final_aggregator(
        agg_results: dict[str, float],
        raw_results: dict[str, list],
        pipeline: MyPipeline,
        dataset: ECGExampleData,
    ):
        # We use pop, as we don't want to have the raw matches in our final results, but this is up to you.
        raw_matches = raw_results.pop("_raw")
        matches = np.vstack(raw_matches)
        precision, recall, f1_score = precision_recall_f1_score(matches)
        agg_results["per_sample__precision"] = precision
        agg_results["per_sample__recall"] = recall
        agg_results["per_sample__f1_score"] = f1_score
        return agg_results, raw_results


    agg_final_agg, raw_final_agg = Scorer(
        partial(score, tolerance_s=0.02), final_aggregator=final_aggregator
    )(pipe, example_data)

    agg_final_agg


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Datapoints:   0%|          | 0/12 [00:00<?, ?it/s]    Datapoints:  58%|█████▊    | 7/12 [00:00<00:00, 64.21it/s]    Datapoints: 100%|██████████| 12/12 [00:00<00:00, 65.87it/s]

    {'precision': 0.9929358534618008, 'recall': 0.6737755326205007, 'f1_score': 0.7089727629059107, 'per_sample__precision': np.float64(0.990271060623102), 'per_sample__recall': np.float64(0.6957054245189839), 'per_sample__f1_score': np.float64(0.8172557027823545)}


.. GENERATED FROM PYTHON SOURCE LINES 620-626

We can see, that this results in the same output as before, but the calculation is more explicit and not hidden away
in one of the aggregators.
Which version you choose is up to you.

Speaking from experience, using the `final_aggregator` is a better choice for complex one-off evaluations.
In case you plan to reuse the same aggregation multiple times, a custom aggregator might be the better choice.

.. GENERATED FROM PYTHON SOURCE LINES 628-629

And finally remove the cache to not affect other examples.

.. GENERATED FROM PYTHON SOURCE LINES 629-632

.. code-block:: default

    from tpcp.caching import remove_any_cache

    remove_any_cache(MyPipeline)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 6.688 seconds)

**Estimated memory usage:**  24 MB


.. _sphx_glr_download_auto_examples_validation__03_custom_scorer.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: _03_custom_scorer.py <_03_custom_scorer.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: _03_custom_scorer.ipynb <_03_custom_scorer.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_