- Dec 2023
-
pythonspeed.com pythonspeed.com
-
When you’re writing Python, though, you want to share Python objects between processes.
To enable this, when you pass Python objects between processes using Python’s multiprocessing library:
- On the sender side, the arguments get serialized to bytes with the pickle module.
- On the receiver side, the bytes are unserialized using
pickle
.
This serialization and deserialization process involves computation, which can potentially be slow.
-
Threads vs. processes
Multiple threads let you run code in parallel, potentially on multiple CPUs. On Python, however, the global interpreter lock makes this parallelism harder to achieve.
Multiple processes also let you run code in parallel—so what’s the difference between threads and processes?
All the threads inside a single process share the same memory address space. If thread 1 in a process stores some memory at address 0x7f0cd1a88810, thread 2 can access the same memory at the same address. That means passing objects between threads is cheap: you just need to get the pointer to the memory address from one thread to the other. A memory address is 8 bytes: this is not a lot of data to move around.
In contrast, processes do not share the same memory space. There are some shared memory facilities provided by the operating system, typically, and we’ll get to that later. But by default, no memory is shared. That means you can’t just share the address of your data across processes: you have to copy the data.
-
-
tonybaloney.github.io tonybaloney.github.io
-
Inter-Worker communication
Whether using sub interpreters or multiprocessing you cannot simply send existing Python objects to worker processes.
Multiprocessing uses
pickle
by default. When you start a process or use a process pool, you can use pipes, queues and shared memory as mechanisms to sending data to/from the workers and the main process. These mechanisms revolve around pickling. Pickling is the builtin serialization library for Python that can convert most Python objects into a byte string and back into a Python object.Pickle is very flexible. You can serialize a lot of different types of Python objects (but not all) and Python objects can even define a method for how they can be serialized. It also handles nested objects and properties. However, with that flexibility comes a performance hit. Pickle is slow. So if you have a worker model that relies upon continuous inter-worker communication of complex pickled data you’ll likely see a bottleneck.
Sub interpreters can accept pickled data. They also have a second mechanism called shared data. Shared data is a high-speed shared memory space that interpreters can write to and share data with other interpreters. It supports only immutable types, those are:
- Strings
- Byte Strings
- Integers and Floats
- Boolean and None
- Tuples (and tuples of tuples)
To share data with an interpreter, you can either set it as initialization data or you can send it through a channel.
-