Algorithm Validation in tpcp ============================ .. note:: If you are unsure about algorithm validation from a general scientific point of view, have a look at our general guide :ref:`here ` first. Pre-Requisites -------------- To use the algorithm validation tools in tpcp, you need to first represent your data as a :class:`~tpcp.Dataset` and implement the algorithms you want to validate as :class:`~tpcp.Pipeline`. All parameters that should be optimized (either internally or using an external wrapper) as part of a parameter search should be exposed as parameters in the init. Train-Test Splits ----------------- As part of any validation for algorithms that require any form of data-driven optimization you need to perform create a hold-out test set. For this purpose you can simply use the respective functions from sklearn (:func:`sklearn.model_selection.train_test_split`). In case you are planning to use crossvalidation (next section), you can also use any of the sklearn CV splitter (e.g. :class:`sklearn.model_selection.KFold`). As :class:`~tpcp.Dataset` classes implement an iterator interface the train-test splits work just like with any other list like structure. Have a look at the :ref:`custom_dataset_basics` example for practical examples. The only important thing you need to keep in mind is, that in tpcp we put all information into a single object. This means we don't have a separation between data and labels on a data-structure level. In case you need to perform a stratified or a grouped split, you need to temporarily create an array with the required labels (for stratified split) or groups (for grouped split) and then pass it as `y` to the `split` method of your splitter. For a stratified split, this might look like this: >>> from sklearn.model_selection import StratifiedKFold >>> >>> splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42) >>> data = CustomDatasetClass(...) >>> label_array = [d.label for d in data] >>> for train_index, test_index in splitter.split(data, label_array): ... train_data = data[train_index] ... test_data = data[test_index] ... # do something with the data For a grouped split it might look like this: >>> from sklearn.model_selection import GroupKFold >>> >>> splitter = GroupKFold(n_splits=2) >>> data = CustomDatasetClass(...) >>> # You can use `create_string_group_labels` method to create an array of group labels based on the dataset index >>> groups = data.create_string_group_labels("patient_groups") >>> for train_index, test_index in splitter.split(data, groups=groups): ... train_data = data[train_index] ... test_data = data[test_index] ... # do something with the data This works well, when you iterate over your folds on your own. If you are planning to use :func:`~tpcp.validate.cross_validate` you need to handle these special cases a little different. More about that in the next section. Cross Validation ---------------- Instead of doing a single train-test split, a cross-validation is usually preferred. Analog to the sklearn function we provide a :func:`~tpcp.validate.cross_validate` function. The api of this function is as similar as possible to the sklearn function. Have a look at the full example for cross-validate for basic usage: :ref:`cross_validation`. A couple of things you should keep in mind: - The first parameter must be an **Optimizer**, not just an optimizable Pipeline. If you have an optimizable pipeline you want to cross-validate withour external parameter search, you need to wrap it into an :class:`~tpcp.optimize.Optimize` object. - If you want to use a pipeline without Optimization in the cross-validate function, you can wrap it in an :class:`~tpcp.optimize.DummyOptimize` object. This object has the correct optimization interface, but does not perform any optimization. In such a case you would usually not need to use a cross-validation, but it might be helpful to run a non-optimizable algorithm on the exact same folds than an optimizable algorithm you want to compare it to. This way you get comparable means and standard deviations over the cross-validation folds - If you want to use stratified or grouped splits, you need to create the arrays for the labels or groups as above and then pass it as the `groups` or `mock_labels` parameter. Note that the `mock_labels` will really only be used for the CV splitter and not for the actual evaluation of the algorithm. Custom Scoring -------------- In tpcp we assume that your problem is likely complex enough to require a custom scoring function. Therefore, we don't provide anything pre-defined. However, we want to make it as easy as possible to pass-through all the information you need to evaluate your algorithm. A scoring function can return any number of metrics (as dict of values). Even further we allow to return any non-numeric values (e.g. meta-data or "raw-results") from scoring functions (a regular frustration I had with sklearn). These non-numeric values can either be passed through all cross-validation or optimization methods by wrapping them with :class:`~tpcp.validate.NoAgg` or passed through any form of custom aggregator (learn more about that :ref:`here`).