Python > Data Science and Machine Learning Libraries > PyTorch > Autograd
PyTorch Autograd Example: Linear Regression
This snippet demonstrates PyTorch's automatic differentiation (autograd) system by building and training a simple linear regression model. It shows how to define tensors, specify that gradients need to be tracked, perform computations, and automatically compute gradients with respect to input tensors.
Setting up the Data and Model
First, we import the `torch` library. Then, we create our input (X) and output (y) tensors. `X` is a tensor of 100 samples, each with one feature, and `y` is generated from `X` using a linear relationship (y = 3X + 2) with added noise. Crucially, `requires_grad=True` is set for the weights (`w`) and bias (`b`). This tells PyTorch to track operations performed on these tensors so that gradients can be computed later.
import torch
# Create input and output tensors
X = torch.randn(100, 1, requires_grad=False) # Input feature
y = 3*X + 2 + torch.randn(100, 1) * 0.1 # Target variable (with some noise)
# Initialize weights and bias
w = torch.randn(1, requires_grad=True) # Weight (requires gradient tracking)
b = torch.randn(1, requires_grad=True) # Bias (requires gradient tracking)
Defining the Loss Function and Optimizer
Here, we define the Mean Squared Error (MSE) loss function, which measures the difference between the model's predictions and the true values. We also define an optimizer using `torch.optim.SGD`. The optimizer is responsible for updating the model's parameters (`w` and `b`) based on the calculated gradients and the specified learning rate. `torch.optim.SGD` implements stochastic gradient descent algorithm.
# Define the loss function (Mean Squared Error)
def mse_loss(y_pred, y_true):
return torch.mean((y_pred - y_true)**2)
# Define the optimizer (Stochastic Gradient Descent)
learning_rate = 0.01
optimizer = torch.optim.SGD([w, b], lr=learning_rate)
The Training Loop
This is the core of the training process. In each epoch: 1. **Forward Pass:** We calculate the model's prediction (`y_pred`) using the current values of `w` and `b`. This is simply `y_pred = X @ w + b`. 2. **Compute Loss:** We calculate the MSE loss between the predicted values (`y_pred`) and the true values (`y`). 3. **Backward Pass:** We call `loss.backward()`. This crucial step automatically computes the gradients of the loss with respect to all tensors that have `requires_grad=True` (in this case, `w` and `b`). These gradients are stored in the `.grad` attribute of the respective tensors. 4. **Update Parameters:** We use the optimizer to update the weights and bias based on the computed gradients. `optimizer.step()` performs one optimization step (e.g., updating parameters via gradient descent). 5. **Zero Gradients:** `optimizer.zero_grad()` is extremely important. It clears the gradients from the previous iteration. If you don't zero the gradients, they will accumulate, leading to incorrect parameter updates. 6. **Print Progress:** We print the loss, weight, and bias every 10 epochs to monitor the training progress.
# Training loop
num_epochs = 100
for epoch in range(num_epochs):
# Forward pass: compute predicted y
y_pred = X @ w + b
# Compute loss
loss = mse_loss(y_pred, y)
# Backward pass: compute gradients of the loss with respect to w and b
loss.backward()
# Update weights and bias
optimizer.step()
# Zero the gradients (important!)
optimizer.zero_grad()
# Print progress
if (epoch+1) % 10 == 0:
print (f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, w: {w.item():.4f}, b: {b.item():.4f}')
Accessing Gradients
After the `backward()` call, the gradients of the loss with respect to `w` and `b` are stored in `w.grad` and `b.grad`, respectively. You can access these gradients to further analyze the learning process. Note that the gradients will only be available *after* the backward pass.
#After the training loop, gradients are stored in w.grad and b.grad
print(f'Gradient of w: {w.grad}')
print(f'Gradient of b: {b.grad}')
Concepts Behind the Snippet
Real-Life Use Case
This simple linear regression example is a foundation for understanding how autograd works. In real-world scenarios, autograd is used extensively in training complex neural networks for tasks such as image recognition, natural language processing, and reinforcement learning. These models have many more parameters, but the underlying principle of automatically computing gradients remains the same.
Best Practices
Interview Tip
Be prepared to explain the role of `requires_grad`, `backward()`, and `optimizer.zero_grad()` in PyTorch's autograd system. Understand the concept of computational graphs and how they are used to compute gradients. Explain how autograd simplifies the process of training complex machine learning models.
When to Use Autograd
Autograd is used whenever you need to train a model using gradient-based optimization. This is the standard approach for training most deep learning models. It simplifies the computation of gradients, allowing you to focus on designing the model architecture and loss function.
Memory Footprint
Autograd can have a significant memory footprint because it stores the computational graph and intermediate values needed for gradient computation. For very large models, techniques like gradient checkpointing can be used to reduce memory usage by recomputing some intermediate values during the backward pass.
Alternatives
While PyTorch's autograd is widely used, other deep learning frameworks like TensorFlow also provide automatic differentiation capabilities. The core concept is the same, but the implementation details may differ.
Pros
Cons
FAQ
-
Why do I need to call `optimizer.zero_grad()`?
The `optimizer.zero_grad()` function clears the gradients of all optimized `torch.Tensor`s. By default, gradients accumulate in `.grad` attributes. Without zeroing them, gradients from previous iterations would be added to the current gradients, leading to incorrect parameter updates. -
What does `loss.backward()` do?
The `loss.backward()` function computes the gradient of the loss with respect to all parameters that have `requires_grad=True`. It traverses the computational graph, applying the chain rule to calculate the gradients at each node. -
What happens if I don't set `requires_grad=True`?
If `requires_grad=True` is not set for a tensor, PyTorch will not track operations performed on that tensor, and no gradients will be computed for it. This is useful for tensors that represent data inputs or pre-trained weights that you don't want to update during training.