Multiprocessing Caveats#
tpcp uses joblib for multiprocessing in validation and optimization helpers.
This page is intended to be the durable documentation version of the multiprocessing notes that were previously tracked
in GitHub issue #119.
The main point is that most multiprocessing problems in tpcp are not specific to tpcp itself.
They come from Python process semantics, joblib worker reuse, and serialization.
However, these issues surface frequently in tpcp because validation, scoring, caching, and optimization often run
many jobs in parallel and often rely on runtime configuration.
When this matters#
These caveats matter whenever you use joblib-based multiprocessing, for example via:
optimizers that expose
n_jobsyour own joblib parallel code that reuses
tpcp.parallel
Global variables and runtime modifications#
In child processes, global variables are reset to their import-time values. This is expected: the code that mutated those globals in the parent process is not replayed automatically in workers.
This becomes relevant whenever your runtime behavior depends on global state, for example:
sklearn or pandas global configuration
global registries
cache decorators or other runtime patching
monkey-patching a class or function after import
In practice this means that code can work in the parent process and then behave differently in workers without any local code changes.
This is also relevant beyond tpcp.
For example, scikit-learn has its own workaround for some of its internal Parallel usage, but those fixes do not
automatically apply to arbitrary user-defined multiprocessing or to tpcp-specific wrappers.
tpcp workaround: tpcp.parallel#
tpcp provides tpcp.parallel as a workaround for this class of problem.
Its delayed wrapper captures registered state in the parent process and restores it in workers.
The workflow is:
register a callback with
register_global_parallel_callbackthe callback returns a
(value, setter)pairtpcp.parallel.delayedcaptures the value in the parent processthe setter is called in the worker before the actual function executes
This is useful whenever some global configuration or global runtime setup must be visible in worker processes.
Example:
from joblib import Parallel
from sklearn import get_config, set_config
from tpcp.parallel import delayed, register_global_parallel_callback, remove_global_parallel_callback
def callback():
def setter(config):
set_config(**config)
return get_config(), setter
name = register_global_parallel_callback(callback)
try:
Parallel(n_jobs=2)(delayed(worker_func)() for _ in range(2))
finally:
remove_global_parallel_callback(name)
Related background:
Note
The workaround is only applied when tpcp.parallel.delayed is used.
If your own code uses joblib.delayed directly, registered callbacks will not run.
Warning
Different libraries may implement their own worker-state restoration logic. These fixes are not guaranteed to be compatible with each other without additional configuration.
Process pools are reused#
When using joblib’s default loky backend, worker processes are often reused.
This happens not only within a single Parallel(...) call, but can also happen across later parallel calls.
This has an important consequence: if a worker mutates global state, that modified state can still be present when a later job is executed in the same worker.
This can lead to surprising behavior such as:
state “leaking” between logically independent parallel jobs
nondeterministic test failures that depend on execution order
interactions with the global-state workaround above that are hard to reason about
If you need a clean pool, for example in tests, shut the reusable executor down explicitly:
from joblib.externals.loky import get_reusable_executor
get_reusable_executor().shutdown(wait=True, kill_workers=True)
This is mainly useful in tests or in debugging scenarios where you need to eliminate worker reuse as a variable.
Serialization is often the real problem#
Many multiprocessing failures are actually serialization failures. To execute work in child processes, joblib must serialize the function to call and all relevant inputs.
For custom classes, functions, and closures this can become subtle.
At a high level, standard pickle prefers passing objects by reference:
for global objects, it stores import path + object name
for instances, it stores the parent type plus object state
This creates a number of common failure modes.
Common serialization traps#
The most common problematic cases are:
instances of classes whose type is not defined globally
classes defined inside functions
lambdas or other callables without a stable global reference
functions or classes that only exist in
__main__globally defined objects that are replaced or modified at runtime before serialization
The __main__ case is particularly common and confusing.
Objects defined in the currently executing script do not have a stable import path from the point of view of worker
processes.
Typical fixes#
If you see serialization failures:
if the error mentions
__main__, move the relevant class/function to an importable moduleif objects are created dynamically from runtime config, move the object creation into the worker and only pass the config
if the object depends on runtime patching, restore that patch explicitly in workers using
register_global_parallel_callback
About cloudpickle#
joblib often falls back to cloudpickle, which can serialize many objects that plain pickle cannot.
This helps with many dynamic objects and closures.
However:
it is slower
it can hide the underlying reason why serialization is fragile
when it also fails, the error messages can become harder to interpret
So while cloudpickle is often helpful, it should not be treated as a guarantee that any dynamic runtime structure is
safe for multiprocessing.
Runtime patching and caching#
Runtime patching deserves separate attention because it is common in tpcp.
For example, you might apply caching decorators or other wrappers after import or after class definition.
This is convenient, but multiprocessing changes the situation:
workers may import the original object, not the patched one
runtime changes in the parent process may not be replayed in workers
if worker pools are reused, worker-local patched state may persist longer than expected
This is particularly relevant when using tpcp caching utilities with multiprocessing. If a runtime-applied cache or decorator must also exist in workers, do not assume that parent-side setup is enough. Use the documented restoration mechanisms and verify the behavior in parallel explicitly.
The caching recipe already hints at this caveat:
examples/recipies/_01_caching.py
Imports can dominate runtime#
Because objects are reconstructed in worker processes, all relevant imports must also resolve correctly in workers.
This can have a substantial runtime cost. For heavy optional dependencies, import overhead can dominate the actual work.
TensorFlow is a typical example: importing it can take seconds, which can erase much of the benefit of multiprocessing for smaller tasks.
A useful rule is:
if a dependency is optional and only needed in some code paths, delay the import until it is actually needed
This does not solve all multiprocessing issues, but it can materially improve runtime behavior.
Practical debugging checklist#
If multiprocessing behaves strangely, check in this order:
Is the problem actually missing global runtime state in workers?
Does the failure mention pickling,
__main__, or an import path?Are you depending on runtime patching or monkey-patching?
Could worker reuse be leaking state between jobs?
Is import overhead larger than the actual parallel workload?
In many cases, the fastest path to clarity is to temporarily run with n_jobs=1.
If the problem disappears, focus on worker state and serialization next.
tpcp-specific takeaways#
If global configuration is missing in workers, use
tpcp.parallel.If you use custom parallel code together with
tpcpworker-state restoration, make sure you usedelayed, notjoblib.delayed.If tests depend on clean workers, explicitly shut down the reusable
lokyexecutor.If a pickle error mentions
__main__, move the relevant object to a normal module first.If runtime-created objects are difficult to serialize, pass configuration into workers and build the object there.
If the multiprocessing setup becomes too fragile,
n_jobs=1is often the pragmatic fallback.