Python > Advanced Topics and Specializations > Concurrency and Parallelism > Processes and the `multiprocessing` Module

Basic Process Creation with Multiprocessing

This snippet demonstrates how to create and start a new process using the multiprocessing module. It shows the fundamental steps involved in parallel execution in Python.

Code Example

This code creates three separate processes. Each process executes the worker function. The multiprocessing.Process class is used to create new processes. The target argument specifies the function to be executed in the new process, and the args argument provides the arguments to that function. The start() method starts the process, and the join() method waits for the process to complete. The if __name__ == '__main__': block is crucial; it prevents the child processes from re-importing and re-running the main script, which can lead to infinite recursion, especially on Windows.

import multiprocessing
import time

def worker(num):
    print(f'Worker {num}: Starting')
    time.sleep(2)  # Simulate some work
    print(f'Worker {num}: Finishing')

if __name__ == '__main__':
    processes = []
    for i in range(3):
        p = multiprocessing.Process(target=worker, args=(i,))
        processes.append(p)
        p.start()

    for p in processes:
        p.join()  # Wait for all processes to finish

    print('All workers finished.')

Concepts Behind the Snippet

The core idea is to achieve parallelism by distributing tasks across multiple processes. Each process has its own memory space, meaning variables and data are not shared by default. This avoids common concurrency issues like race conditions that are prevalent in multithreaded environments. The multiprocessing module creates entirely new Python interpreters, each with its own Global Interpreter Lock (GIL), allowing true parallel execution on multi-core systems. The GIL limits true parallelism in multithreaded Python programs because only one thread can hold control of the Python interpreter at any given time. Processes circumvent this limitation.

Real-Life Use Case

Imagine you're building a web scraper that needs to fetch data from hundreds of websites. Scraping each website sequentially would be very slow. By using multiprocessing, you can scrape multiple websites concurrently, significantly reducing the overall scraping time. Another example would be image or video processing, where large files can be split into smaller chunks and processed in parallel by different processes.

Best Practices

Always use the if __name__ == '__main__': guard when using multiprocessing, especially on Windows. This prevents child processes from re-importing and re-executing the main script. Consider using a process pool (multiprocessing.Pool) for managing a group of worker processes efficiently. Be mindful of inter-process communication (IPC). Sharing data between processes requires explicit mechanisms like pipes, queues, or shared memory. Keep the workload of each process relatively independent to minimize communication overhead.

Interview Tip

Be prepared to explain the difference between threads and processes in Python. Focus on the GIL limitation in multithreading and how multiprocessing overcomes it by creating separate Python interpreters. Also, discuss the challenges of inter-process communication and the available mechanisms for sharing data between processes.

When to Use Processes

Use processes when you need true parallel execution on multi-core systems and are dealing with CPU-bound tasks. Processes are also suitable when you need isolation between different parts of your application, as each process has its own memory space. Choose processes when data sharing is minimal, as data needs to be serialized and deserialized when sent between processes, adding overhead.

Memory Footprint

Processes generally have a higher memory footprint compared to threads because each process has its own copy of the Python interpreter and the program's data. This can be a significant consideration if you need to create a large number of processes.

Alternatives

Alternatives to multiprocessing include multithreading (using the threading module), asynchronous programming (using asyncio), and distributed computing frameworks like Dask or Spark. Multithreading is suitable for I/O-bound tasks where the GIL is not a major bottleneck. Asynchronous programming is a good choice for concurrent I/O operations and single-threaded event loops. Dask and Spark are designed for large-scale data processing and distributed computing across multiple machines.

Pros

  • True parallelism on multi-core systems.
  • Isolation between processes, preventing crashes in one process from affecting others.
  • Overcomes the GIL limitation of Python threads for CPU-bound tasks.

Cons

  • Higher memory overhead compared to threads.
  • More complex inter-process communication.
  • Process creation can be slower than thread creation.

FAQ

  • What is the Global Interpreter Lock (GIL)?

    The GIL is a mechanism in CPython that allows only one thread to hold control of the Python interpreter at any given time. This prevents multiple threads from executing Python bytecode in parallel, effectively limiting true parallelism in multithreaded Python programs. Processes circumvent this limitation because each process has its own Python interpreter and GIL.

  • How do I share data between processes?

    You can share data between processes using mechanisms like multiprocessing.Queue, multiprocessing.Pipe, and multiprocessing.sharedctypes. Queues provide a thread-safe way to pass messages between processes. Pipes allow bidirectional communication between two processes. Shared memory (sharedctypes) allows processes to access and modify the same memory region. Choose the appropriate mechanism based on your data sharing needs and the level of synchronization required.