Algorithm Validation in tpcp#
Note
If you are unsure about algorithm validation from a general scientific point of view, have a look at our general guide here first.
Pre-Requisites#
To use the algorithm validation tools in tpcp, you need to first represent your data as a Dataset and
implement the algorithms you want to validate as Pipeline.
All parameters that should be optimized (either internally or using an external wrapper) as part of a parameter search
should be exposed as parameters in the init.
Train-Test Splits#
As part of any validation for algorithms that require any form of data-driven optimization you need to perform create a
hold-out test set.
For this purpose you can simply use the respective functions from sklearn
(sklearn.model_selection.train_test_split).
In case you are planning to use crossvalidation (next section), you can also use any of the sklearn CV splitter
(e.g. sklearn.model_selection.KFold).
As Dataset classes implement an iterator interface the train-test splits work just like with any other
list like structure.
Have a look at the Custom Dataset - Basics example for practical examples.
The only important thing you need to keep in mind is, that in tpcp we put all information into a single object.
This means we don’t have a separation between data and labels on a data-structure level.
In case you need to perform a stratified or a grouped split, you need to temporarily create an array with the required
labels (for stratified split) or groups (for grouped split) and then pass it as y to the split method of your
splitter.
For a stratified split, this might look like this:
>>> from sklearn.model_selection import StratifiedKFold
>>>
>>> splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
>>> data = CustomDatasetClass(...)
>>> label_array = [d.label for d in data]
>>> for train_index, test_index in splitter.split(data, label_array):
... train_data = data[train_index]
... test_data = data[test_index]
... # do something with the data
For a grouped split it might look like this:
>>> from sklearn.model_selection import GroupKFold
>>>
>>> splitter = GroupKFold(n_splits=2)
>>> data = CustomDatasetClass(...)
>>> # You can use `create_string_group_labels` method to create an array of group labels based on the dataset index
>>> groups = data.create_string_group_labels("patient_groups")
>>> for train_index, test_index in splitter.split(data, groups=groups):
... train_data = data[train_index]
... test_data = data[test_index]
... # do something with the data
This works well, when you iterate over your folds on your own.
If you are planning to use cross_validate you need to handle these special cases a little
different.
More about that in the next section.
Cross Validation#
Instead of doing a single train-test split, a cross-validation is usually preferred.
Analog to the sklearn function we provide a cross_validate function.
The api of this function is as similar as possible to the sklearn function.
Have a look at the full example for cross-validate for basic usage: Cross Validation.
A couple of things you should keep in mind:
The first parameter must be an Optimizer, not just an optimizable Pipeline. If you have an optimizable pipeline you want to cross-validate withour external parameter search, you need to wrap it into an
Optimizeobject.If you want to use a pipeline without Optimization in the cross-validate function, you can wrap it in an
DummyOptimizeobject. This object has the correct optimization interface, but does not perform any optimization. In such a case you would usually not need to use a cross-validation, but it might be helpful to run a non-optimizable algorithm on the exact same folds than an optimizable algorithm you want to compare it to. This way you get comparable means and standard deviations over the cross-validation foldsIf you want to use stratified or grouped splits, you need to create the arrays for the labels or groups as above and then pass it as the
groupsormock_labelsparameter. Note that themock_labelswill really only be used for the CV splitter and not for the actual evaluation of the algorithm.
Custom Scoring#
In tpcp we assume that your problem is likely complex enough to require a custom scoring function. Therefore, we don’t provide anything pre-defined. However, we want to make it as easy as possible to pass-through all the information you need to evaluate your algorithm.
A scoring function can return any number of metrics (as dict of values).
Even further we allow to return any non-numeric values (e.g. meta-data or “raw-results”) from scoring functions
(a regular frustration I had with sklearn).
These non-numeric values can either be passed through all cross-validation or optimization methods by wrapping them
with NoAgg or passed through any form of custom aggregator (learn more about that
here).