Algorithm Validation in tpcp#
If you are unsure about algorithm validation from a general scientific point of view, have a look at our general guide here first.
To use the algorithm validation tools in tpcp, you need to first represent your data as a
implement the algorithms you want to validate as
All parameters that should be optimized (either internally or using an external wrapper) as part of a parameter search
should be exposed as parameters in the init.
As part of any validation for algorithms that require any form of data-driven optimization you need to perform create a
hold-out test set.
For this purpose you can simply use the respective functions from sklearn
In case you are planning to use crossvalidation (next section), you can also use any of the sklearn CV splitter
The only important thing you need to keep in mind is, that in tpcp we put all information into a single object.
This means we don’t have a separation between data and labels on a data-structure level.
In case you need to perform a stratified or a grouped split, you need to temporarily create an array with the required
labels (for stratified split) or groups (for grouped split) and then pass it as
y to the
split method of your
For a stratified split, this might look like this:
>>> from sklearn.model_selection import StratifiedKFold >>> >>> splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42) >>> data = CustomDatasetClass(...) >>> label_array = [d.label for d in data] >>> for train_index, test_index in splitter.split(data, label_array): ... train_data = data[train_index] ... test_data = data[test_index] ... # do something with the data
For a grouped split it might look like this:
>>> from sklearn.model_selection import GroupKFold >>> >>> splitter = GroupKFold(n_splits=2) >>> data = CustomDatasetClass(...) >>> # You can use `create_string_group_labels` method to create an array of group labels based on the dataset index >>> groups = data.create_string_group_labels("patient_groups") >>> for train_index, test_index in splitter.split(data, groups=groups): ... train_data = data[train_index] ... test_data = data[test_index] ... # do something with the data
This works well, when you iterate over your folds on your own.
If you are planning to use
cross_validate you need to handle these special cases a little
More about that in the next section.
Instead of doing a single train-test split, a cross-validation is usually preferred.
Analog to the sklearn function we provide a
The api of this function is as similar as possible to the sklearn function.
Have a look at the full example for cross-validate for basic usage: Cross Validation.
A couple of things you should keep in mind:
The first parameter must be an Optimizer, not just an optimizable Pipeline. If you have an optimizable pipeline you want to cross-validate withour external parameter search, you need to wrap it into an
If you want to use a pipeline without Optimization in the cross-validate function, you can wrap it in an
DummyOptimizeobject. This object has the correct optimization interface, but does not perform any optimization. In such a case you would usually not need to use a cross-validation, but it might be helpful to run a non-optimizable algorithm on the exact same folds than an optimizable algorithm you want to compare it to. This way you get comparable means and standard deviations over the cross-validation folds
If you want to use stratified or grouped splits, you need to create the arrays for the labels or groups as above and then pass it as the
mock_labelsparameter. Note that the
mock_labelswill really only be used for the CV splitter and not for the actual evaluation of the algorithm.
In tpcp we assume that your problem is likely complex enough to require a custom scoring function. Therefore, we don’t provide anything pre-defined. However, we want to make it as easy as possible to pass-through all the information you need to evaluate your algorithm.
A scoring function can return any number of metrics (as dict of values).
Even further we allow to return any non-numeric values (e.g. meta-data or “raw-results”) from scoring functions
(a regular frustration I had with sklearn).
These non-numeric values can either be passed through all cross-validation or optimization methods by wrapping them
NoAgg or passed through any form of custom aggregator (learn more about that