.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/recipies/_04_typed_iterator.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_recipies__04_typed_iterator.py: .. _typed_iterator: TypedIterator ============= This example shows how to use the :class:`~tpcp.misc.TypedIterator` class, which might be helpful, when iterating over data and needing to store multiple results for each iteration. The Problem ----------- A very common pattern when working with any type of data is to iterate over it and then apply a series of operations to it. In simple cases you might only want to store the final result, but often you are also interested in intermediate or alternative outputs. What typically happens, is that you create multiple empty lists or dictionaries (one for each result) and then append the results to them during the iteration. At the end you might apply further operations to the results, e.g. aggregations. Below is a simple example of this pattern: .. GENERATED FROM PYTHON SOURCE LINES 23-44 .. code-block:: default data = [1, 2, 3, 4, 5] result_1 = [] result_2 = [] result_3 = [] for d in data: intermediate_result_1 = d * 3 result_1.append(intermediate_result_1) intermediate_result_2 = intermediate_result_1 * 2 result_2.append(intermediate_result_2) final_result_3 = intermediate_result_2 - 4 result_3.append(final_result_3) # An example aggregation result_1 = sum(result_1) print(result_1) print(result_2) print(result_3) .. rst-class:: sphx-glr-script-out .. code-block:: none 45 [6, 12, 18, 24, 30] [2, 8, 14, 20, 26] .. GENERATED FROM PYTHON SOURCE LINES 45-65 Fundamentally, this pattern works well. However, it does not really fit into the idea of declarative code that we are trying to achieve with tpcp. While programming, there are 3 places where you need to think about the result and the result types. This makes it harder to reason about the code and also makes it harder to change the code later on. In addition, the main pipeline code, which should be the most important part of the code, is cluttered with boilerplate code concerned with just storing the results. While we could fix some of these issues by refactoring a little, with `TypedIterator` we provide (in our opinion) a much cleaner solution. The basic idea of `TypedIterator` is to provide a way to specify all configuration (i.e. what results to expect and how to aggregate them) in one place at the beginning. It further simplifies how to store results, by inverting the data structure. Instead of worrying about one data structure for each result, you only need to worry about one data structure for each iteration. Using dataclasses, these objects are also typed, preventing typos and providing IDE support. Let's rewrite the above example using `TypedIterator`: 1. We define our result-datatype as a dataclass. .. GENERATED FROM PYTHON SOURCE LINES 65-75 .. code-block:: default from dataclasses import dataclass @dataclass class ResultType: result_1: int result_2: int result_3: int .. GENERATED FROM PYTHON SOURCE LINES 76-79 2. We define the aggregations we want to apply to the results. If we don't want to aggregate a result, we simply don't add it to the list. We provide some more explanation on aggregations below, just accept this for now. .. GENERATED FROM PYTHON SOURCE LINES 79-83 .. code-block:: default aggregations = [ ("result_1", lambda _, results: sum(results)), ] .. GENERATED FROM PYTHON SOURCE LINES 84-87 3. We create a new instance of `TypedIterator` with the result type and the aggregations. We use the "square bracket" typing syntax to bind the output datatype. This way, our IDE is able to autocomplete the attributes of the result type. .. GENERATED FROM PYTHON SOURCE LINES 87-91 .. code-block:: default from tpcp.misc import TypedIterator iterator = TypedIterator[ResultType](ResultType, aggregations=aggregations) .. GENERATED FROM PYTHON SOURCE LINES 92-93 Now we can iterate over our data and get a result object for each iteration, that we can then fill with the results. .. GENERATED FROM PYTHON SOURCE LINES 93-98 .. code-block:: default for d, r in iterator.iterate(data): r.result_1 = d * 3 r.result_2 = r.result_1 * 2 r.result_3 = r.result_2 - 4 .. GENERATED FROM PYTHON SOURCE LINES 99-103 You can access the data in two different ways. 1. Using the ``results_`` attribute, which is an instance of ``ResultType`. Just note that the typing of the attributes is incorrect. .. GENERATED FROM PYTHON SOURCE LINES 103-105 .. code-block:: default iterator.results_ .. rst-class:: sphx-glr-script-out .. code-block:: none ResultType(result_1=45, result_2=[6, 12, 18, 24, 30], result_3=[2, 8, 14, 20, 26]) .. GENERATED FROM PYTHON SOURCE LINES 106-107 However, the big advantage of this approach is that your IDE should be able to autocomplete the attributes. .. GENERATED FROM PYTHON SOURCE LINES 107-109 .. code-block:: default iterator.results_.result_1 .. rst-class:: sphx-glr-script-out .. code-block:: none 45 .. GENERATED FROM PYTHON SOURCE LINES 110-113 2. Alternative you can access the results as dynamically assignes attributes of the iterator. Note, that you need to add a trailing underscore to the attribute name. As we are following the typically tpcp convention of using trailing underscores for result attributes. .. GENERATED FROM PYTHON SOURCE LINES 113-117 .. code-block:: default print(iterator.result_1_) print(iterator.result_2_) print(iterator.result_3_) .. rst-class:: sphx-glr-script-out .. code-block:: none 45 [6, 12, 18, 24, 30] [2, 8, 14, 20, 26] .. GENERATED FROM PYTHON SOURCE LINES 118-119 The raw results are available as a list of dataclass instances. .. GENERATED FROM PYTHON SOURCE LINES 119-121 .. code-block:: default iterator.raw_results_ .. rst-class:: sphx-glr-script-out .. code-block:: none [ResultType(result_1=3, result_2=6, result_3=2), ResultType(result_1=6, result_2=12, result_3=8), ResultType(result_1=9, result_2=18, result_3=14), ResultType(result_1=12, result_2=24, result_3=20), ResultType(result_1=15, result_2=30, result_3=26)] .. GENERATED FROM PYTHON SOURCE LINES 122-135 While this version of the code required a couple more lines, it is much easier to understand and reason about. It clearly separates the configuration from the actual code and the core pipeline code is much cleaner. A real-world example -------------------- Below we apply this pattern to a pipeline that iterates over an actual dataset. The return types are a little bit more complex to show some more advanced features of aggregations. For this example we apply the QRS detection algorithm to the ECG dataset demonstrated in some of the other examples. The QRS detection algorithm only has a single output. Hence, we use the "number of r-peaks" as a second result here to demonstrate the use case. Again we start by defining the result dataclass. .. GENERATED FROM PYTHON SOURCE LINES 135-146 .. code-block:: default import pandas as pd @dataclass class QRSResultType: """The result type of the QRS detection algorithm.""" r_peak_positions: pd.Series n_r_peaks: int .. GENERATED FROM PYTHON SOURCE LINES 147-153 For the aggregations, we want to concatenate the r-peak positions. The aggregation function gets the list of inputs as the first argument and the list of results as the second argument. We can use this to create a combined dataframe with a proper index. We turn the `n_r_peaks` into a dictionary, to make it easier to map the results back to the inputs. .. GENERATED FROM PYTHON SOURCE LINES 153-165 .. code-block:: default aggregations = [ ( "r_peak_positions", lambda datapoints, results: pd.concat(results, keys=[d.group_label for d in datapoints]), ), ( "n_r_peaks", lambda datapoints, results: dict(zip([d.group_label for d in datapoints], results)), ), ] .. GENERATED FROM PYTHON SOURCE LINES 166-167 Now we can create the iterator and iterate over the dataset. .. GENERATED FROM PYTHON SOURCE LINES 167-186 .. code-block:: default from pathlib import Path from examples.algorithms.algorithms_qrs_detection_final import QRSDetector from examples.datasets.datasets_final_ecg import ECGExampleData iterator = TypedIterator[QRSResultType](QRSResultType, aggregations=aggregations) try: HERE = Path(__file__).parent except NameError: HERE = Path().resolve() data_path = HERE.parent.parent / "example_data/ecg_mit_bih_arrhythmia/data" dataset = ECGExampleData(data_path) for d, r in iterator.iterate(dataset): r.r_peak_positions = QRSDetector().detect(d.data["ecg"], sampling_rate_hz=d.sampling_rate_hz).r_peak_positions_ r.n_r_peaks = len(r.r_peak_positions) .. GENERATED FROM PYTHON SOURCE LINES 187-188 Finally we can inspect the results stored on the iterator. .. GENERATED FROM PYTHON SOURCE LINES 188-190 .. code-block:: default iterator.results_ .. rst-class:: sphx-glr-script-out .. code-block:: none QRSResultType(r_peak_positions=group_1 100 0 77 1 370 2 663 3 947 4 1231 ... group_3 200 1448 647546 1449 648357 1450 648629 1451 649409 1452 649928 Length: 17782, dtype: int64, n_r_peaks={ECGExampleDataGroupLabel(patient_group='group_1', participant='100'): 2270, ECGExampleDataGroupLabel(patient_group='group_2', participant='102'): 1710, ECGExampleDataGroupLabel(patient_group='group_3', participant='104'): 2066, ECGExampleDataGroupLabel(patient_group='group_1', participant='105'): 2567, ECGExampleDataGroupLabel(patient_group='group_2', participant='106'): 1704, ECGExampleDataGroupLabel(patient_group='group_3', participant='108'): 78, ECGExampleDataGroupLabel(patient_group='group_1', participant='114'): 30, ECGExampleDataGroupLabel(patient_group='group_2', participant='116'): 2392, ECGExampleDataGroupLabel(patient_group='group_3', participant='119'): 1988, ECGExampleDataGroupLabel(patient_group='group_1', participant='121'): 6, ECGExampleDataGroupLabel(patient_group='group_2', participant='123'): 1518, ECGExampleDataGroupLabel(patient_group='group_3', participant='200'): 1453}) .. GENERATED FROM PYTHON SOURCE LINES 191-192 Note, that `r_peak_positions_` is a single dataframe now and not a list of dataframes. .. GENERATED FROM PYTHON SOURCE LINES 192-194 .. code-block:: default iterator.r_peak_positions_ .. rst-class:: sphx-glr-script-out .. code-block:: none group_1 100 0 77 1 370 2 663 3 947 4 1231 ... group_3 200 1448 647546 1449 648357 1450 648629 1451 649409 1452 649928 Length: 17782, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 195-196 The `n_r_peaks_` is still a dictionary, as excpected. .. GENERATED FROM PYTHON SOURCE LINES 196-198 .. code-block:: default iterator.n_r_peaks_ .. rst-class:: sphx-glr-script-out .. code-block:: none {ECGExampleDataGroupLabel(patient_group='group_1', participant='100'): 2270, ECGExampleDataGroupLabel(patient_group='group_2', participant='102'): 1710, ECGExampleDataGroupLabel(patient_group='group_3', participant='104'): 2066, ECGExampleDataGroupLabel(patient_group='group_1', participant='105'): 2567, ECGExampleDataGroupLabel(patient_group='group_2', participant='106'): 1704, ECGExampleDataGroupLabel(patient_group='group_3', participant='108'): 78, ECGExampleDataGroupLabel(patient_group='group_1', participant='114'): 30, ECGExampleDataGroupLabel(patient_group='group_2', participant='116'): 2392, ECGExampleDataGroupLabel(patient_group='group_3', participant='119'): 1988, ECGExampleDataGroupLabel(patient_group='group_1', participant='121'): 6, ECGExampleDataGroupLabel(patient_group='group_2', participant='123'): 1518, ECGExampleDataGroupLabel(patient_group='group_3', participant='200'): 1453} .. GENERATED FROM PYTHON SOURCE LINES 199-200 The raw results are still available a list of dataclass instances. .. GENERATED FROM PYTHON SOURCE LINES 200-202 .. code-block:: default iterator.raw_results_ .. rst-class:: sphx-glr-script-out .. code-block:: none [QRSResultType(r_peak_positions=0 77 1 370 2 663 3 947 4 1231 ... 2265 648978 2266 649232 2267 649485 2268 649734 2269 649992 Length: 2270, dtype: int64, n_r_peaks=2270), QRSResultType(r_peak_positions=0 409 1 697 2 988 3 1304 4 1613 ... 1705 648639 1706 648930 1707 649243 1708 649553 1709 649851 Length: 1710, dtype: int64, n_r_peaks=1710), QRSResultType(r_peak_positions=0 17 1 314 2 613 3 899 4 1186 ... 2061 648729 2062 649021 2063 649298 2064 649578 2065 649874 Length: 2066, dtype: int64, n_r_peaks=2066), QRSResultType(r_peak_positions=0 197 1 459 2 708 3 964 4 1221 ... 2562 648733 2563 648977 2564 649221 2565 649471 2566 649740 Length: 2567, dtype: int64, n_r_peaks=2567), QRSResultType(r_peak_positions=0 351 1 725 2 1086 3 1448 4 1830 ... 1699 648969 1700 649161 1701 649335 1702 649792 1703 649990 Length: 1704, dtype: int64, n_r_peaks=1704), QRSResultType(r_peak_positions=0 10875 1 168524 2 169689 3 170426 4 170802 ... 73 343872 74 359503 75 361856 76 472918 77 526420 Length: 78, dtype: int64, n_r_peaks=78), QRSResultType(r_peak_positions=0 281594 1 281953 2 282291 3 299048 4 300134 5 300486 6 303833 7 304565 8 305674 9 306034 10 307475 11 314265 12 354268 13 469019 14 477999 15 512726 16 513064 17 513384 18 627927 19 629093 20 629709 21 630859 22 631156 23 636224 24 636519 25 636821 26 637118 27 637399 28 637652 29 638046 dtype: int64, n_r_peaks=30), QRSResultType(r_peak_positions=0 16 1 284 2 562 3 838 4 1105 ... 2387 648934 2388 649192 2389 649444 2390 649703 2391 649958 Length: 2392, dtype: int64, n_r_peaks=2392), QRSResultType(r_peak_positions=0 309 1 504 2 977 3 1315 4 1651 ... 1983 648792 1984 649129 1985 649468 1986 649788 1987 649985 Length: 1988, dtype: int64, n_r_peaks=1988), QRSResultType(r_peak_positions=0 1569 1 88217 2 92814 3 168263 4 301711 5 581676 dtype: int64, n_r_peaks=6), QRSResultType(r_peak_positions=0 71 1 551 2 1022 3 1499 4 1926 ... 1513 648248 1514 648627 1515 648999 1516 649343 1517 649690 Length: 1518, dtype: int64, n_r_peaks=1518), QRSResultType(r_peak_positions=0 488 1 965 2 1434 3 1883 4 2332 ... 1448 647546 1449 648357 1450 648629 1451 649409 1452 649928 Length: 1453, dtype: int64, n_r_peaks=1453)] .. GENERATED FROM PYTHON SOURCE LINES 203-204 And the inputs are stored as well. .. GENERATED FROM PYTHON SOURCE LINES 204-206 .. code-block:: default iterator.inputs_ .. rst-class:: sphx-glr-script-out .. code-block:: none [ECGExampleData [1 groups/rows] patient_group participant 0 group_1 100, ECGExampleData [1 groups/rows] patient_group participant 0 group_2 102, ECGExampleData [1 groups/rows] patient_group participant 0 group_3 104, ECGExampleData [1 groups/rows] patient_group participant 0 group_1 105, ECGExampleData [1 groups/rows] patient_group participant 0 group_2 106, ECGExampleData [1 groups/rows] patient_group participant 0 group_3 108, ECGExampleData [1 groups/rows] patient_group participant 0 group_1 114, ECGExampleData [1 groups/rows] patient_group participant 0 group_2 116, ECGExampleData [1 groups/rows] patient_group participant 0 group_3 119, ECGExampleData [1 groups/rows] patient_group participant 0 group_1 121, ECGExampleData [1 groups/rows] patient_group participant 0 group_2 123, ECGExampleData [1 groups/rows] patient_group participant 0 group_3 200] .. GENERATED FROM PYTHON SOURCE LINES 207-215 Custom Iterators ---------------- When passing an iterable directly is not really convenient, you can also create a custom iterator class. This class can reimplement ``iterate`` with custom logic. For example, you could provide a custom iterator that takes a data and a sections parameter and then loops over the sections of the data. For this we need to create a custom subclass inheriting from ``BaseTypedIterator``. .. GENERATED FROM PYTHON SOURCE LINES 215-231 .. code-block:: default from collections.abc import Iterator from typing import Generic, TypeVar from tpcp.misc import BaseTypedIterator CustomTypeT = TypeVar("CustomTypeT") class SectionIterator(BaseTypedIterator[CustomTypeT], Generic[CustomTypeT]): def iterate(self, data: pd.DataFrame, sections: pd.DataFrame) -> Iterator[tuple[pd.DataFrame, CustomTypeT]]: # We turn the sections into a generator of dataframes data_iterable = (data.iloc[s.start : s.end] for s in sections.itertuples(index=False)) # We use the `_iterate` method to do the heavy lifting yield from self._iterate(data_iterable) .. GENERATED FROM PYTHON SOURCE LINES 232-233 We create some dummy data and sections to test the iterator. .. GENERATED FROM PYTHON SOURCE LINES 233-237 .. code-block:: default dummy_data = pd.DataFrame({"data": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) dummy_sections = pd.DataFrame({"start": [0, 5], "end": [5, 10]}) .. GENERATED FROM PYTHON SOURCE LINES 238-241 Now we can use the iterator to iterate over the data. We skip any form of aggregation here, as it is not really relevant for this example, but it would work the same way as before. .. GENERATED FROM PYTHON SOURCE LINES 241-252 .. code-block:: default @dataclass class SimpleResultType: n_samples: int custom_iterator = SectionIterator[SimpleResultType](SimpleResultType) for d, r in custom_iterator.iterate(dummy_data, dummy_sections): print(d) r.n_samples = len(d) .. rst-class:: sphx-glr-script-out .. code-block:: none data 0 1 1 2 2 3 3 4 4 5 data 5 6 6 7 7 8 8 9 9 10 .. GENERATED FROM PYTHON SOURCE LINES 253-255 We can see that the iterator iterated over the two sections of the data. And the raw results contain two instances of the result dataclass. .. GENERATED FROM PYTHON SOURCE LINES 255-257 .. code-block:: default custom_iterator.raw_results_ .. rst-class:: sphx-glr-script-out .. code-block:: none [SimpleResultType(n_samples=5), SimpleResultType(n_samples=5)] .. GENERATED FROM PYTHON SOURCE LINES 258-260 .. code-block:: default custom_iterator.results_ .. rst-class:: sphx-glr-script-out .. code-block:: none SimpleResultType(n_samples=[5, 5]) .. GENERATED FROM PYTHON SOURCE LINES 261-263 .. code-block:: default custom_iterator.n_samples_ .. rst-class:: sphx-glr-script-out .. code-block:: none [5, 5] .. GENERATED FROM PYTHON SOURCE LINES 264-265 .. code-block:: default custom_iterator.inputs_ .. rst-class:: sphx-glr-script-out .. code-block:: none [ data 0 1 1 2 2 3 3 4 4 5, data 5 6 6 7 7 8 8 9 9 10] .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 6.141 seconds) **Estimated memory usage:** 14 MB .. _sphx_glr_download_auto_examples_recipies__04_typed_iterator.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: _04_typed_iterator.py <_04_typed_iterator.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: _04_typed_iterator.ipynb <_04_typed_iterator.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_