Multiprocessing Caveats#

tpcp uses joblib for multiprocessing in validation and optimization helpers. This page is intended to be the durable documentation version of the multiprocessing notes that were previously tracked in GitHub issue #119.

The main point is that most multiprocessing problems in tpcp are not specific to tpcp itself. They come from Python process semantics, joblib worker reuse, and serialization. However, these issues surface frequently in tpcp because validation, scoring, caching, and optimization often run many jobs in parallel and often rely on runtime configuration.

When this matters#

These caveats matter whenever you use joblib-based multiprocessing, for example via:

validate
cross_validate
Scorer
optimizers that expose n_jobs
your own joblib parallel code that reuses tpcp.parallel

Global variables and runtime modifications#

In child processes, global variables are reset to their import-time values. This is expected: the code that mutated those globals in the parent process is not replayed automatically in workers.

This becomes relevant whenever your runtime behavior depends on global state, for example:

sklearn or pandas global configuration
global registries
cache decorators or other runtime patching
monkey-patching a class or function after import

In practice this means that code can work in the parent process and then behave differently in workers without any local code changes.

This is also relevant beyond tpcp. For example, scikit-learn has its own workaround for some of its internal Parallel usage, but those fixes do not automatically apply to arbitrary user-defined multiprocessing or to tpcp-specific wrappers.

tpcp workaround: `tpcp.parallel`#

tpcp provides tpcp.parallel as a workaround for this class of problem. Its delayed wrapper captures registered state in the parent process and restores it in workers.

The workflow is:

register a callback with register_global_parallel_callback
the callback returns a (value, setter) pair
tpcp.parallel.delayed captures the value in the parent process
the setter is called in the worker before the actual function executes

This is useful whenever some global configuration or global runtime setup must be visible in worker processes.

Example:

from joblib import Parallel
from sklearn import get_config, set_config
from tpcp.parallel import delayed, register_global_parallel_callback, remove_global_parallel_callback

def callback():
    def setter(config):
        set_config(**config)

    return get_config(), setter

name = register_global_parallel_callback(callback)
try:
    Parallel(n_jobs=2)(delayed(worker_func)() for _ in range(2))
finally:
    remove_global_parallel_callback(name)

Related background:

Note

The workaround is only applied when tpcp.parallel.delayed is used. If your own code uses joblib.delayed directly, registered callbacks will not run.

Warning

Different libraries may implement their own worker-state restoration logic. These fixes are not guaranteed to be compatible with each other without additional configuration.

Process pools are reused#

When using joblib’s default loky backend, worker processes are often reused. This happens not only within a single Parallel(...) call, but can also happen across later parallel calls.

This has an important consequence: if a worker mutates global state, that modified state can still be present when a later job is executed in the same worker.

This can lead to surprising behavior such as:

state “leaking” between logically independent parallel jobs
nondeterministic test failures that depend on execution order
interactions with the global-state workaround above that are hard to reason about

If you need a clean pool, for example in tests, shut the reusable executor down explicitly:

from joblib.externals.loky import get_reusable_executor

get_reusable_executor().shutdown(wait=True, kill_workers=True)

This is mainly useful in tests or in debugging scenarios where you need to eliminate worker reuse as a variable.

Serialization is often the real problem#

Many multiprocessing failures are actually serialization failures. To execute work in child processes, joblib must serialize the function to call and all relevant inputs.

For custom classes, functions, and closures this can become subtle.

At a high level, standard pickle prefers passing objects by reference:

for global objects, it stores import path + object name
for instances, it stores the parent type plus object state

This creates a number of common failure modes.

Common serialization traps#

The most common problematic cases are:

instances of classes whose type is not defined globally
classes defined inside functions
lambdas or other callables without a stable global reference
functions or classes that only exist in __main__
globally defined objects that are replaced or modified at runtime before serialization

The __main__ case is particularly common and confusing. Objects defined in the currently executing script do not have a stable import path from the point of view of worker processes.

Typical fixes#

If you see serialization failures:

if the error mentions __main__, move the relevant class/function to an importable module
if objects are created dynamically from runtime config, move the object creation into the worker and only pass the config
if the object depends on runtime patching, restore that patch explicitly in workers using register_global_parallel_callback

About `cloudpickle`#

joblib often falls back to cloudpickle, which can serialize many objects that plain pickle cannot. This helps with many dynamic objects and closures.

However:

it is slower
it can hide the underlying reason why serialization is fragile
when it also fails, the error messages can become harder to interpret

So while cloudpickle is often helpful, it should not be treated as a guarantee that any dynamic runtime structure is safe for multiprocessing.

Runtime patching and caching#

Runtime patching deserves separate attention because it is common in tpcp. For example, you might apply caching decorators or other wrappers after import or after class definition.

This is convenient, but multiprocessing changes the situation:

workers may import the original object, not the patched one
runtime changes in the parent process may not be replayed in workers
if worker pools are reused, worker-local patched state may persist longer than expected

This is particularly relevant when using tpcp caching utilities with multiprocessing. If a runtime-applied cache or decorator must also exist in workers, do not assume that parent-side setup is enough. Use the documented restoration mechanisms and verify the behavior in parallel explicitly.

The caching recipe already hints at this caveat:

examples/recipies/_01_caching.py

Imports can dominate runtime#

Because objects are reconstructed in worker processes, all relevant imports must also resolve correctly in workers.

This can have a substantial runtime cost. For heavy optional dependencies, import overhead can dominate the actual work.

TensorFlow is a typical example: importing it can take seconds, which can erase much of the benefit of multiprocessing for smaller tasks.

A useful rule is:

if a dependency is optional and only needed in some code paths, delay the import until it is actually needed

This does not solve all multiprocessing issues, but it can materially improve runtime behavior.

Practical debugging checklist#

If multiprocessing behaves strangely, check in this order:

Is the problem actually missing global runtime state in workers?
Does the failure mention pickling, __main__, or an import path?
Are you depending on runtime patching or monkey-patching?
Could worker reuse be leaking state between jobs?
Is import overhead larger than the actual parallel workload?

In many cases, the fastest path to clarity is to temporarily run with n_jobs=1. If the problem disappears, focus on worker state and serialization next.

tpcp-specific takeaways#

If global configuration is missing in workers, use tpcp.parallel.
If you use custom parallel code together with tpcp worker-state restoration, make sure you use delayed, not joblib.delayed.
If tests depend on clean workers, explicitly shut down the reusable loky executor.
If a pickle error mentions __main__, move the relevant object to a normal module first.
If runtime-created objects are difficult to serialize, pass configuration into workers and build the object there.
If the multiprocessing setup becomes too fragile, n_jobs=1 is often the pragmatic fallback.

Multiprocessing Caveats#

When this matters#

Global variables and runtime modifications#

tpcp workaround: tpcp.parallel#

Process pools are reused#

Serialization is often the real problem#

Common serialization traps#

Typical fixes#

About cloudpickle#

Runtime patching and caching#

Imports can dominate runtime#

Practical debugging checklist#

tpcp-specific takeaways#

Related APIs#

tpcp workaround: `tpcp.parallel`#

About `cloudpickle`#