Algorithm Validation in tpcp#

Note

If you are unsure about algorithm validation from a general scientific point of view, have a look at our general guide here first.

Pre-Requisites#

To use the algorithm validation tools in tpcp, you need to first represent your data as a Dataset and implement the algorithms you want to validate as Pipeline. All parameters that should be optimized (either internally or using an external wrapper) as part of a parameter search should be exposed as parameters in the init.

Train-Test Splits#

As part of any validation for algorithms that require any form of data-driven optimization you need to perform create a hold-out test set. For this purpose you can simply use the respective functions from sklearn (sklearn.model_selection.train_test_split). In case you are planning to use crossvalidation (next section), you can also use any of the sklearn CV splitter (e.g. sklearn.model_selection.KFold).

As Dataset classes implement an iterator interface the train-test splits work just like with any other list like structure. Have a look at the Custom Dataset - Basics example for practical examples.

The only important thing you need to keep in mind is, that in tpcp we put all information into a single object. This means we don’t have a separation between data and labels on a data-structure level. In case you need to perform a stratified or a grouped split, you need to temporarily create an array with the required labels (for stratified split) or groups (for grouped split) and then pass it as y to the split method of your splitter.

For a stratified split, this might look like this:

>>> from sklearn.model_selection import StratifiedKFold
>>>
>>> splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
>>> data = CustomDatasetClass(...)
>>> label_array = [d.label for d in data]
>>> for train_index, test_index in splitter.split(data, label_array):
...     train_data = data[train_index]
...     test_data = data[test_index]
...     # do something with the data

For a grouped split it might look like this:

>>> from sklearn.model_selection import GroupKFold
>>>
>>> splitter = GroupKFold(n_splits=2)
>>> data = CustomDatasetClass(...)
>>> # You can use `create_group_labels` method to create an array of group labels based on the dataset index
>>> groups = data.create_group_labels("patient_groups")
>>> for train_index, test_index in splitter.split(data, groups=groups):
...     train_data = data[train_index]
...     test_data = data[test_index]
...     # do something with the data

This works well, when you iterate over your folds on your own. If you are planning to use cross_validate you need to handle these special cases a little different. More about that in the next section.

Cross Validation#

Instead of doing a single train-test split, a cross-validation is usually preferred. Analog to the sklearn function we provide a cross_validate function. The api of this function is as similar as possible to the sklearn function.

Have a look at the full example for cross-validate for basic usage: Cross Validation.

A couple of things you should keep in mind:

The first parameter must be an Optimizer, not just an optimizable Pipeline. If you have an optimizable pipeline you want to cross-validate withour external parameter search, you need to wrap it into an Optimize object.
If you want to use a pipeline without Optimization in the cross-validate function, you can wrap it in an DummyOptimize object. This object has the correct optimization interface, but does not perform any optimization. In such a case you would usually not need to use a cross-validation, but it might be helpful to run a non-optimizable algorithm on the exact same folds than an optimizable algorithm you want to compare it to. This way you get comparable means and standard deviations over the cross-validation folds
If you want to use stratified or grouped splits, you need to create the arrays for the labels or groups as above and then pass it as the groups or mock_labels parameter. Note that the mock_labels will really only be used for the CV splitter and not for the actual evaluation of the algorithm.

Custom Scoring#

In tpcp we assume that your problem is likely complex enough to require a custom scoring function. Therefore, we don’t provide anything pre-defined. However, we want to make it as easy as possible to pass-through all the information you need to evaluate your algorithm.

A scoring function can return any number of metrics (as dict of values). Even further we allow to return any non-numeric values (e.g. meta-data or “raw-results”) from scoring functions (a regular frustration I had with sklearn). These non-numeric values can either be passed through all cross-validation or optimization methods by wrapping them with NoAgg or passed through any form of custom aggregator (learn more about that here).