- Dec 2023
-
pythonspeed.com pythonspeed.com
-
Running the code in a subprocess is much slower than running a thread, not because the computation is slower, but because of the overhead of copying and (de)serializing the data. So how do you avoid this overhead?
Reducing the performance hit of copying data between processes:
Option #1: Just use threads
Processes have overhead, threads do not. And while it’s true that generic Python code won’t parallelize well when using multiple threads, that’s not necessarily true for your Python code. For example, NumPy releases the GIL for many of its operations, which means you can use multiple CPU cores even with threads.
``` # numpy_gil.py import numpy as np from time import time from multiprocessing.pool import ThreadPool
arr = np.ones((1024, 1024, 1024))
start = time() for i in range(10): arr.sum() print("Sequential:", time() - start)
expected = arr.sum()
start = time() with ThreadPool(4) as pool: result = pool.map(np.sum, [arr] * 10) assert result == [expected] * 10 print("4 threads:", time() - start) ```
When run, we see that NumPy uses multiple cores just fine when using threads, at least for this operation:
$ python numpy_gil.py Sequential: 4.253053188323975 4 threads: 1.3854241371154785
Pandas is built on NumPy, so many numeric operations will likely release the GIL as well. However, anything involving strings, or Python objects in general, will not. So another approach is to use a library like Polars which is designed from the ground-up for parallelism, to the point where you don’t have to think about it at all, it has an internal thread pool.
Option #2: Live with it
If you’re stuck with using processes, you might just decide to live with the overhead of pickling. In particular, if you minimize how much data gets passed and forth between processes, and the computation in each process is significant enough, the cost of copying and serializing data might not significantly impact your program’s runtime. Spending a few seconds on pickling doesn’t really matter if your subsequent computation takes 10 minutes.
Option #3: Write the data to disk
Instead of passing data directly, you can write the data to disk, and then pass the path to this file: * to the subprocess (as an argument) * to parent process (as the return value of the function running in the worker process).
The recipient process can then parse the file.
``` import pandas as pd import multiprocessing as mp from pathlib import Path from tempfile import mkdtemp from time import time
def noop(df: pd.DataFrame): # real code would process the dataframe here pass
def noop_from_path(path: Path): df = pd.read_parquet(path, engine="fastparquet") # real code would process the dataframe here pass
def main(): df = pd.DataFrame({"column": list(range(10_000_000))})
with mp.get_context("spawn").Pool(1) as pool: # Pass the DataFrame to the worker process # directly, via pickling: start = time() pool.apply(noop, (df,)) print("Pickling-based:", time() - start) # Write the DataFrame to a file, pass the path to # the file to the worker process: start = time() path = Path(mkdtemp()) / "temp.parquet" df.to_parquet( path, engine="fastparquet", # Run faster by skipping compression: compression="uncompressed", ) pool.apply(noop_from_path, (path,)) print("Parquet-based:", time() - start)
if name == "main": main()
`` **Option #4:
multiprocessing.shared_memory`**Because processes sometimes do want to share memory, operating systems typically provide facilities for explicitly creating shared memory between processes. Python wraps this facilities in the
multiprocessing.shared_memory module
.However, unlike threads, where the same memory address space allows trivially sharing Python objects, in this case you’re mostly limited to sharing arrays. And as we’ve seen, NumPy releases the GIL for expensive operations, which means you can just use threads, which is much simpler. Still, in case you ever need it, it’s worth knowing this module exists.
Note: The module also includes ShareableList, which is a bit like a Python list but limited to int, float, bool, small str and bytes, and None. But this doesn’t help you cheaply share an arbitrary Python object.
A bad option for Linux: the "fork" context
You may have noticed we did
multiprocessing.get_context("spawn").Pool()
to create a process pool. This is because Python has multiple implementations of multiprocessing on some OSes. "spawn" is the only option on Windows, the only non-broken option on macOS, and available on Linux. When using "spawn", a completely new process is created, so you always have to copy data across.On Linux, the default is "fork": the new child process has a complete copy of the memory of the parent process at the time of the child process’ creation. This means any objects in the parent (arrays, giant dicts, whatever) that were created before the child process was created, and were stored somewhere helpful like a module, are accessible to the child. Which means you don’t need to pickle/unpickle to access them.
Sounds useful, right? There’s only one problem: the "fork" context is super-broken, which is why it will stop being the default in Python 3.14.
Consider the following program:
``` import threading import sys from multiprocessing import Process
def thread1(): for i in range(1000): print("hello", file=sys.stderr)
threading.Thread(target=thread1).start()
def foo(): pass
Process(target=foo).start() ```
On my computer, this program consistently deadlocks: it freezes and never exits. Any time you have threads in the parent process, the "fork" context can cause in potential deadlocks, or even corrupted memory, in the child process.
You might think that you’re fine because you don’t start any threads. But many Python libraries start a thread pool on import, for example NumPy. If you’re using NumPy, Pandas, or any other library that depends on NumPy, you are running a threaded program, and therefore at risk of deadlocks, segfaults, or data corruption when using the "fork" multiprocessing context. For more details see this article on why multiprocessing’s default is broken on Linux.
You’re just shooting yourself in the foot if you take this approach.
-
-
www.bitecode.dev www.bitecode.dev
-
Both are bad if you need to cancel tasks, collaborate a lot between tasks, deal precisely with the task lifecycle, needs a huge number of workers or want to milk out every single bit of perfs. You won’t get nowhere near Rust level of speed.
-
Thread pools are good for:
- Tasks (network, file, etc.) that needs less than 10_000 I/O interactions per second. The number is higher than you would expect, because threads are surprisingly cheap nowadays, and you can spawn a lot of them without bloating memory too much. The limit is more the price of context switching. This is not a scientific number, it's a general direction that you should challenge by measuring your own particular case.
- When you need to share data between the tasks.
- When you are not CPU bound.
- When you are OK to execute tasks a bit slower to you ensure you are not blocking any of them (E.G: user UI and a long calculation).
- When you are CPU bound, but the CPU calculations are delegating to a C extension that releases the GIL, such as numpy. Free parallelism on the cheap, yeah!
E.G: a web scraper, a GUI to zip files, a development server, sending emails without blocking web page rendering, etc.
-