Algorithm Validation in tpcp#

Note

If you are unsure about algorithm validation from a general scientific point of view, have a look at our general guide here first.

Pre-Requisites#

To use the algorithm validation tools in tpcp, you need to first represent your data as a Dataset and implement the algorithms you want to validate as Pipeline. All parameters that should be optimized (either internally or using an external wrapper) as part of a parameter search should be exposed as parameters in the init.

Train-Test Splits#

As part of any validation for algorithms that require any form of data-driven optimization you need to perform create a hold-out test set. For this purpose you can simply use the respective functions from sklearn (sklearn.model_selection.train_test_split). In case you are planning to use crossvalidation (next section), you can also use any of the sklearn CV splitter (e.g. sklearn.model_selection.KFold).

As Dataset classes implement an iterator interface the train-test splits work just like with any other list like structure. Have a look at the Custom Dataset - Basics example for practical examples.

The only important thing you need to keep in mind is, that in tpcp we put all information into a single object. This means we don’t have a separation between data and labels on a data-structure level. In case you need to perform a stratified or a grouped split, you need to temporarily create an array with the required labels (for stratified split) or groups (for grouped split) and then pass it as y to the split method of your splitter.

For a stratified split, this might look like this:

>>> from sklearn.model_selection import StratifiedKFold
>>>
>>> splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
>>> data = CustomDatasetClass(...)
>>> label_array = [d.label for d in data]
>>> for train_index, test_index in splitter.split(data, label_array):
...     train_data = data[train_index]
...     test_data = data[test_index]
...     # do something with the data

For a grouped split it might look like this:

>>> from sklearn.model_selection import GroupKFold
>>>
>>> splitter = GroupKFold(n_splits=2)
>>> data = CustomDatasetClass(...)
>>> # You can use `create_string_group_labels` method to create an array of group labels based on the dataset index
>>> groups = data.create_string_group_labels("patient_groups")
>>> for train_index, test_index in splitter.split(data, groups=groups):
...     train_data = data[train_index]
...     test_data = data[test_index]
...     # do something with the data

This works well, when you iterate over your folds on your own. If you are planning to use cross_validate you need to handle these special cases a little different. More about that in the next section.

Cross Validation#

Instead of doing a single train-test split, a cross-validation is usually preferred. Analog to the sklearn function we provide a cross_validate function. The api of this function is as similar as possible to the sklearn function.

Have a look at the full example for cross-validate for basic usage: Cross Validation.

A couple of things you should keep in mind:

  • The first parameter must be an Optimizer, not just an optimizable Pipeline. If you have an optimizable pipeline you want to cross-validate withour external parameter search, you need to wrap it into an Optimize object.

  • If you want to use a pipeline without Optimization in the cross-validate function, you can wrap it in an DummyOptimize object. This object has the correct optimization interface, but does not perform any optimization. In such a case you would usually not need to use a cross-validation, but it might be helpful to run a non-optimizable algorithm on the exact same folds than an optimizable algorithm you want to compare it to. This way you get comparable means and standard deviations over the cross-validation folds

  • If you want to use stratified or grouped splits, you need to create the arrays for the labels or groups as above and then pass it as the groups or mock_labels parameter. Note that the mock_labels will really only be used for the CV splitter and not for the actual evaluation of the algorithm.

Custom Scoring#

In tpcp we assume that your problem is likely complex enough to require a custom scoring function. Therefore, we don’t provide anything pre-defined. However, we want to make it as easy as possible to pass-through all the information you need to evaluate your algorithm.

A scoring function can return any number of metrics (as dict of values). Even further we allow to return any non-numeric values (e.g. meta-data or “raw-results”) from scoring functions (a regular frustration I had with sklearn). These non-numeric values can either be passed through all cross-validation or optimization methods by wrapping them with NoAgg or passed through any form of custom aggregator (learn more about that here).