Optimization and Training#
Pre-Requisites#
To use the algorithm validation tools in tpcp, you need to first represent your data as a Dataset
and
implement the algorithms you want to validate as Pipeline
.
All parameters that should be optimized (either internally or using an external wrapper) as part of a parameter search
should be exposed as parameters in the init.
General Concepts#
In tpcp
, we use the term Optimization as a wrapper term for any form of data-driven parameter optimization.
This can be traditional ML training of model weights, black-box optimizations of hyperparameters or a simple grid search
of thresholds in classical algorithms.
Therefore, we attempt to have a unified interface for all these cases.
This is achieved by defining “optimization” as any form of data-driven optimization of the
“parameters” specified in the __init__
of an algorithm.
This optimization can be performed via internal optimization implemented in a self_optimize
method on the pipeline
or via external optimization like the GridSearch
wrapper.
>>> from tpcp.optimize import GridSearch
>>>
>>> my_pipeline = MyPipeline(val1="initial_value")
>>> gs = GridSearch(my_pipeline, {"val1": ["optimized_value_1", "optimized_value_2"]})
>>> gs = gs.optimize(train_data)
>>> my_optimized_pipeline = gs.optimized_pipeline_
>>> my_optimized_pipeline.val1
"optimized_value_2"
For pipelines that implement a self_optimize
method, it is recommended to use the Optimize
wrapper instead of calling self_optimize
directly.
>>> from tpcp.optimize import Optimize
>>>
>>> my_optimizable_pipeline = MyOptimizablePipeline(val1="initial_value")
>>> my_optimized_pipeline = Optimize(my_optimizable_pipeline).optimize(train_data).optimized_pipeline_
>>> my_optimized_pipeline.val1
"optimized_val1"
Parameter Annotations#
When talking about optimization it becomes clear, that we need to differentiate the different types of parameters an algorithm or pipeline might have. They can fall into three categories:
optimizable parameters: These parameters represent values and models that are/can be optimized using the
self_optimize
method.hyper-parameters: These are parameters that change, how the optimization in
self_optimize
is performed.“normal” parameters: Basically everything else. These parameters do neither influence nor are influenced by
self_optimize
. They only influence the output of theaction
method of the pipeline. See the evaluation guide to better understand the distinction between parameters and hyper-parameters.
To make these distinction clear (for human and machine), tpcp
provides a set of Type hints that can be applied
class level parameters to annotate the respective parameters:
>>> from tpcp import OptimizableParameter, HyperParameter, Parameter
>>>
>>> class MyOptimizablePipeline(OptimizablePipeline):
... nn_weight_vector: OptimizableParameter[np.ndarray]
... simple_threshold: Parameter[int]
... my_hyper_parameter: HyperParameter[float]
...
... def __init__(self, nn_weight_vector: np.ndarray, simple_threshold: int, my_hyper_parameter: float):
... ...
This helps not only with documentation, but can actually be used to perform sanity checks when running the optimization.
For example, if after running self_optimize
of a pipeline is called, none of the optimizable parameters is changed,
likely something has gone wrong.
Such checks can be performed by Optimize
class or the make_optimize_safe
decorators based on the provided parameter annotations.
Have a look at the documentation there to understand which checks are performed.
To see these parameter annotations in action, check out this example.
Note
One special case of parameter annotations is the tpcp.PureParameter
.
It can be used to annotate a parameter that does not influence the self_optmize
method of a pipeline.
I.e. it is only used and relevant for the action method.
This can be useful information for parameter search methods like GridSearchCV
, as they
don’t need to rerun the optimization when only pure parameters are changed.
For GridSearchCV
such an optimization can be enabled via the pure_parameters
parameter.
External Optimization vs self_optimize
#
When implementing a new algorithm or pipeline that should have optimizable parameter, you need to decide whether to
implement an explicit self_optimize
method or use (or create) an external parameter optimizer like the
GridSearch
.
The simple advise here is, that you should never “re-implement” any form of “dumb” search within a self_optimize
method.
The self_optimize
should only be used, if there are algorithm specific details or methods that can be used to optimize
parameters far more efficient than random search (or similar).
For example, the backpropagation logic for a neuronal network would be a candidate for self_optimize
.
It is domain specific (i.e. not generic) and hence, is less suited for a general “parameter” optimizer class.
However, at the end the line between to two domains is a bit fuzzy.
You might very well decide to implement something in the self_optimize
method, and later decide to move this logic
into a more generic optimizer class.
Or you might start with a generic GridSearch and move to a self_optimize
method, once you realise, you need very
specific modifications for your algorithm or group of algorithms.