Python tutorials > Advanced Python Concepts > Concurrency and Parallelism > What are processes (`multiprocessing`)?

What are processes (`multiprocessing`)?

Understanding Processes with multiprocessing in Python

This tutorial explores the concept of processes in Python, specifically how to leverage the multiprocessing module to achieve true parallelism. We will delve into the creation, management, and communication between processes, providing practical code examples and explanations.

Introduction to Processes and multiprocessing

What are Processes?

In the context of computing, a process is an instance of a program in execution. It has its own memory space, resources, and system threads. Unlike threads within a single process, processes run independently and do not share memory by default. This isolation prevents issues like race conditions and deadlocks that can occur when multiple threads access shared resources concurrently.

Why use multiprocessing?

The multiprocessing module in Python allows you to spawn processes much like threads. However, the key difference is that processes sidestep the Global Interpreter Lock (GIL) limitation of CPython, enabling true parallel execution on multi-core processors. This is crucial for CPU-bound tasks where significant performance gains can be achieved by distributing the workload across multiple cores.

Creating and Starting Processes

Code Explanation

  1. Import multiprocessing: We start by importing the necessary module.
  2. Define a Worker Function: The worker function represents the task that each process will execute. In this example, it performs a computationally intensive task (calculating the sum of squares) to simulate CPU-bound work.
  3. Create Process Objects: Inside the if __name__ == '__main__': block (which is essential for multiprocessing on some operating systems like Windows), we create a list of Process objects. Each Process is initialized with:
    • target: The function to be executed (worker).
    • args: A tuple of arguments to pass to the worker function.
  4. Start the Processes: We iterate through the list of processes and call the start() method on each one. This creates a new process and begins executing the worker function within it.
  5. Join the Processes: The join() method blocks the main process until the corresponding process has completed its execution. This ensures that the main program waits for all worker processes to finish before exiting.

import multiprocessing

def worker(num):
    """Worker function to be executed in a separate process."""
    print(f'Process {num}: Starting')
    # Simulate some work
    result = sum(i * i for i in range(1000000))
    print(f'Process {num}: Finished, Result: {result}')

if __name__ == '__main__':
    processes = []
    for i in range(3):
        p = multiprocessing.Process(target=worker, args=(i,))
        processes.append(p)
        p.start()

    for p in processes:
        p.join() # Wait for all processes to complete

    print('All processes completed.')

Communication between Processes (Pipes)

Using Pipes for Inter-Process Communication

Pipes provide a simple mechanism for unidirectional communication between processes. The multiprocessing.Pipe() function creates a pair of connected file descriptors representing the ends of the pipe. One process can send data through one end (conn.send()), and another process can receive data from the other end (conn.recv()).

Code Explanation

  1. Create a Pipe: parent_conn, child_conn = multiprocessing.Pipe() creates two connection objects. Data written to child_conn can be read from parent_conn.
  2. Define Sender and Receiver Functions: The sender function sends a list of messages through its connection, and the receiver function receives and prints messages until it encounters an EOFError (indicating the connection has been closed).
  3. Create Processes: Two processes are created: one for sending and one for receiving.
  4. Start Processes and Join: The processes are started, and the main process waits for them to complete.

import multiprocessing

def sender(conn, messages):
    """Sends messages through the connection."""
    for message in messages:
        conn.send(message)
        print(f'Sent: {message}')
    conn.close()

def receiver(conn):
    """Receives messages from the connection."""
    while True:
        try:
            message = conn.recv()
            print(f'Received: {message}')
        except EOFError:
            break
    print('Receiver finished.')

if __name__ == '__main__':
    parent_conn, child_conn = multiprocessing.Pipe()
    messages_to_send = ['Hello', 'from', 'process', 'land!']

    p1 = multiprocessing.Process(target=sender, args=(child_conn, messages_to_send))
    p2 = multiprocessing.Process(target=receiver, args=(parent_conn,))

    p1.start()
    p2.start()

    p1.join()
    p2.join()

    print('Done!')

Communication between Processes (Queues)

Using Queues for Inter-Process Communication

multiprocessing.Queue provides a thread-safe, process-safe FIFO queue. It is often preferred over pipes for more complex communication patterns. Queues automatically handle locking, making it easier to exchange data safely between processes.

Code Explanation

  1. Create a Queue: q = multiprocessing.Queue() creates a new queue object.
  2. Define Producer and Consumer Functions: The producer function puts items into the queue, and the consumer function retrieves and processes items from the queue. The producer adds None to the queue to signal the consumer to stop.
  3. Create Processes: Two processes are created: one to produce items and one to consume them.
  4. Start Processes and Join: The processes are started and the main process waits for them to complete.

import multiprocessing

def producer(queue, items):
    """Adds items to the queue."""
    for item in items:
        queue.put(item)
        print(f'Produced: {item}')
    queue.put(None)  # Signal the consumer to stop

def consumer(queue):
    """Consumes items from the queue."""
    while True:
        item = queue.get()
        if item is None:
            break
        print(f'Consumed: {item}')
    print('Consumer finished.')

if __name__ == '__main__':
    q = multiprocessing.Queue()
    items_to_produce = ['Apple', 'Banana', 'Cherry']

    p1 = multiprocessing.Process(target=producer, args=(q, items_to_produce))
    p2 = multiprocessing.Process(target=consumer, args=(q,))

    p1.start()
    p2.start()

    p1.join()
    p2.join()

    print('Done!')

Concepts Behind the Snippet

The snippets illustrate core concepts of process-based concurrency:

  • Process Creation: Creating independent execution environments using multiprocessing.Process.
  • Parallel Execution: Leveraging multiple CPU cores to execute tasks concurrently, overcoming the GIL limitation.
  • Inter-Process Communication (IPC): Using pipes and queues to exchange data and synchronize activities between processes.

Real-Life Use Case Section

Consider a scenario where you need to process a large image dataset. Instead of processing images sequentially in a single process, you can distribute the processing across multiple processes using multiprocessing. Each process can handle a subset of the images, significantly reducing the overall processing time.

Another use case is web scraping. You can use multiple processes to scrape different websites concurrently, gathering data much faster than a single-threaded scraper.

Best Practices

  • Use if __name__ == '__main__':: This is crucial, especially on Windows, to prevent recursive process creation.
  • Clean Up Resources: Ensure that processes are properly terminated and resources (like connections and queues) are closed to avoid resource leaks.
  • Handle Exceptions: Implement robust error handling in worker functions to prevent processes from crashing unexpectedly.
  • Consider Process Pools: For managing a large number of processes, use multiprocessing.Pool to efficiently reuse processes and reduce overhead.

Interview Tip

Be prepared to discuss the differences between threads and processes, the limitations of the GIL, and the advantages of using multiprocessing for CPU-bound tasks. Also, understand the different IPC mechanisms available and their trade-offs (e.g., pipes vs. queues).

When to Use Them

Use processes when:

  • You have CPU-bound tasks that can benefit from parallel execution.
  • You need to avoid the GIL limitation of Python threads.
  • You require strong isolation between different parts of your program.

Memory Footprint

Each process has its own memory space, resulting in a higher memory footprint compared to threads within a single process. This is because each process needs to load its own copy of libraries and data. Consider this when designing your application, especially when dealing with very large datasets or a high number of processes.

Alternatives

Alternatives to multiprocessing include:

  • threading: Suitable for I/O-bound tasks where the GIL is not a major bottleneck.
  • asyncio: A single-threaded concurrency model based on coroutines, ideal for I/O-bound and high-concurrency scenarios.
  • Dask: A flexible parallel computing library for Python that can scale from single-machine to distributed clusters.

Pros

  • True Parallelism: Bypasses the GIL and utilizes multiple CPU cores effectively.
  • Fault Isolation: If one process crashes, it typically does not affect other processes.
  • Improved Performance for CPU-Bound Tasks: Significant speedups can be achieved by distributing computationally intensive tasks across multiple processes.

Cons

  • Higher Memory Overhead: Each process has its own memory space.
  • More Complex IPC: Communication between processes requires explicit mechanisms like pipes or queues.
  • Process Creation Overhead: Creating and managing processes can be more expensive than creating and managing threads.

FAQ

  • What is the Global Interpreter Lock (GIL)?

    The GIL is a mutex in CPython that allows only one thread to hold control of the Python interpreter at any given time. This means that even on multi-core processors, only one thread can execute Python bytecode at a time. This limits the ability of threads to achieve true parallelism for CPU-bound tasks.

  • When should I use multiprocessing instead of threading?

    Use multiprocessing for CPU-bound tasks where you want to achieve true parallel execution and bypass the GIL limitation. Use threading for I/O-bound tasks where threads spend most of their time waiting for external operations to complete.

  • How do I share data between processes?

    You can share data between processes using various IPC mechanisms provided by the multiprocessing module, such as pipes, queues, shared memory, and managers.

  • What are process pools?

    A process pool is a collection of worker processes that are created at the start of a program and are reused to execute multiple tasks. Using process pools can improve performance by reducing the overhead of creating and destroying processes for each task.