Python > Data Science and Machine Learning Libraries > PyTorch > Autograd

PyTorch Autograd Example: Linear Regression

This snippet demonstrates PyTorch's automatic differentiation (autograd) system by building and training a simple linear regression model. It shows how to define tensors, specify that gradients need to be tracked, perform computations, and automatically compute gradients with respect to input tensors.

Setting up the Data and Model

First, we import the `torch` library. Then, we create our input (X) and output (y) tensors. `X` is a tensor of 100 samples, each with one feature, and `y` is generated from `X` using a linear relationship (y = 3X + 2) with added noise. Crucially, `requires_grad=True` is set for the weights (`w`) and bias (`b`). This tells PyTorch to track operations performed on these tensors so that gradients can be computed later.

import torch

# Create input and output tensors
X = torch.randn(100, 1, requires_grad=False)  # Input feature
y = 3*X + 2 + torch.randn(100, 1) * 0.1         # Target variable (with some noise)

# Initialize weights and bias
w = torch.randn(1, requires_grad=True)       # Weight (requires gradient tracking)
b = torch.randn(1, requires_grad=True)       # Bias (requires gradient tracking)

Defining the Loss Function and Optimizer

Here, we define the Mean Squared Error (MSE) loss function, which measures the difference between the model's predictions and the true values. We also define an optimizer using `torch.optim.SGD`. The optimizer is responsible for updating the model's parameters (`w` and `b`) based on the calculated gradients and the specified learning rate. `torch.optim.SGD` implements stochastic gradient descent algorithm.

# Define the loss function (Mean Squared Error)
def mse_loss(y_pred, y_true):
    return torch.mean((y_pred - y_true)**2)

# Define the optimizer (Stochastic Gradient Descent)
learning_rate = 0.01
optimizer = torch.optim.SGD([w, b], lr=learning_rate)

The Training Loop

This is the core of the training process. In each epoch: 1. **Forward Pass:** We calculate the model's prediction (`y_pred`) using the current values of `w` and `b`. This is simply `y_pred = X @ w + b`. 2. **Compute Loss:** We calculate the MSE loss between the predicted values (`y_pred`) and the true values (`y`). 3. **Backward Pass:** We call `loss.backward()`. This crucial step automatically computes the gradients of the loss with respect to all tensors that have `requires_grad=True` (in this case, `w` and `b`). These gradients are stored in the `.grad` attribute of the respective tensors. 4. **Update Parameters:** We use the optimizer to update the weights and bias based on the computed gradients. `optimizer.step()` performs one optimization step (e.g., updating parameters via gradient descent). 5. **Zero Gradients:** `optimizer.zero_grad()` is extremely important. It clears the gradients from the previous iteration. If you don't zero the gradients, they will accumulate, leading to incorrect parameter updates. 6. **Print Progress:** We print the loss, weight, and bias every 10 epochs to monitor the training progress.

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass: compute predicted y
    y_pred = X @ w + b

    # Compute loss
    loss = mse_loss(y_pred, y)

    # Backward pass: compute gradients of the loss with respect to w and b
    loss.backward()

    # Update weights and bias
    optimizer.step()

    # Zero the gradients (important!)
    optimizer.zero_grad()

    # Print progress
    if (epoch+1) % 10 == 0:
        print (f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, w: {w.item():.4f}, b: {b.item():.4f}')

Accessing Gradients

After the `backward()` call, the gradients of the loss with respect to `w` and `b` are stored in `w.grad` and `b.grad`, respectively. You can access these gradients to further analyze the learning process. Note that the gradients will only be available *after* the backward pass.

#After the training loop, gradients are stored in w.grad and b.grad
print(f'Gradient of w: {w.grad}')
print(f'Gradient of b: {b.grad}')

Concepts Behind the Snippet

  • Autograd: PyTorch's automatic differentiation system. It tracks operations on tensors and automatically computes gradients.
  • requires_grad: A tensor property that indicates whether gradients should be tracked for that tensor.
  • backward(): A method called on a loss tensor that computes gradients of the loss with respect to all tensors that have `requires_grad=True`.
  • SGD: Stochastic Gradient Descent, a common optimization algorithm.
  • Gradients: The rate of change of a function with respect to its variables. In this case, the gradients indicate how much the loss would change if we slightly change the weights and bias.

Real-Life Use Case

This simple linear regression example is a foundation for understanding how autograd works. In real-world scenarios, autograd is used extensively in training complex neural networks for tasks such as image recognition, natural language processing, and reinforcement learning. These models have many more parameters, but the underlying principle of automatically computing gradients remains the same.

Best Practices

  • Always zero the gradients after each optimization step using `optimizer.zero_grad()` to prevent accumulation of gradients from previous iterations.
  • Make sure `requires_grad=True` for the tensors you want to compute gradients for.
  • Use a suitable optimizer for your problem (e.g., Adam, SGD, RMSprop).
  • Monitor the loss during training to ensure that the model is learning.

Interview Tip

Be prepared to explain the role of `requires_grad`, `backward()`, and `optimizer.zero_grad()` in PyTorch's autograd system. Understand the concept of computational graphs and how they are used to compute gradients. Explain how autograd simplifies the process of training complex machine learning models.

When to Use Autograd

Autograd is used whenever you need to train a model using gradient-based optimization. This is the standard approach for training most deep learning models. It simplifies the computation of gradients, allowing you to focus on designing the model architecture and loss function.

Memory Footprint

Autograd can have a significant memory footprint because it stores the computational graph and intermediate values needed for gradient computation. For very large models, techniques like gradient checkpointing can be used to reduce memory usage by recomputing some intermediate values during the backward pass.

Alternatives

While PyTorch's autograd is widely used, other deep learning frameworks like TensorFlow also provide automatic differentiation capabilities. The core concept is the same, but the implementation details may differ.

Pros

  • Ease of Use: Autograd makes it easy to compute gradients without manually deriving them.
  • Flexibility: It supports a wide range of operations and custom functions.
  • Dynamic Graphs: PyTorch uses dynamic computational graphs, which means the graph is built on the fly as operations are executed, allowing for more flexibility in model design.

Cons

  • Memory Consumption: Autograd can consume a significant amount of memory, especially for large models.
  • Performance Overhead: There is a slight performance overhead associated with tracking operations for gradient computation.

FAQ

  • Why do I need to call `optimizer.zero_grad()`?

    The `optimizer.zero_grad()` function clears the gradients of all optimized `torch.Tensor`s. By default, gradients accumulate in `.grad` attributes. Without zeroing them, gradients from previous iterations would be added to the current gradients, leading to incorrect parameter updates.
  • What does `loss.backward()` do?

    The `loss.backward()` function computes the gradient of the loss with respect to all parameters that have `requires_grad=True`. It traverses the computational graph, applying the chain rule to calculate the gradients at each node.
  • What happens if I don't set `requires_grad=True`?

    If `requires_grad=True` is not set for a tensor, PyTorch will not track operations performed on that tensor, and no gradients will be computed for it. This is useful for tensors that represent data inputs or pre-trained weights that you don't want to update during training.