Machine learning > Model Interpretability > Interpretation Techniques > Partial Dependence Plots

Partial Dependence Plots (PDPs): Visualizing Feature Effects

Partial Dependence Plots (PDPs) are a powerful technique for visualizing the marginal effect of one or two features on the predicted outcome of a machine learning model. They help understand how changes in specific feature values influence model predictions, holding other features constant. This tutorial provides a comprehensive guide to PDPs, including code examples and practical considerations.

What are Partial Dependence Plots?

A Partial Dependence Plot (PDP) shows the average predicted outcome as a function of one or two input features. It essentially visualizes the functional relationship between the target variable and the chosen features, marginalizing over the values of all other input features.

Mathematically, for a model f(x) and a feature of interest x_s, the partial dependence function is defined as:

PDP(x_s) = E_{x_c}[f(x_s, x_c)]

Where:

x_s is the set of features for which we want to plot the partial dependence.
x_c is the set of all other features.
E_{x_c} represents the average over the marginal distribution of x_c.

Key Concepts Behind the PDP Calculation

The core idea behind PDPs is to estimate the average model prediction across all possible values of the features we aren't interested in. To do this practically, we:

Select the feature(s) we want to analyze.
Define a range of values for the selected feature(s).
For each value in the range:
- Replace the values of the selected feature(s) in the original dataset with that value.
- Make predictions with the model using this modified dataset.
- Average the predictions across all rows of the modified dataset.
Plot the average predictions against the values of the selected feature(s).

This process reveals how the model's output changes, on average, as we vary the chosen feature(s).

Python Implementation with scikit-learn

This code snippet demonstrates how to generate PDPs using scikit-learn. First, a sample dataset is created using make_friedman1. Then, a GradientBoostingRegressor is trained on the training data. Finally, PartialDependenceDisplay.from_estimator is used to generate the plots. The features argument specifies which features to plot. Here, we plot the partial dependence of feature 0, feature 1, and the interaction between features 0 and 1.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import PartialDependenceDisplay
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_friedman1
import matplotlib.pyplot as plt

# Generate a sample dataset
X, y = make_friedman1(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train a Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=0)
gbr.fit(X_train, y_train)

# Create the Partial Dependence Plot
features = [0, 1, (0, 1)] # Features to plot: 0, 1, and the interaction between 0 and 1
fig, ax = plt.subplots(figsize=(12, 4))
PartialDependenceDisplay.from_estimator(gbr, X_train, features, ax=ax)
plt.suptitle('Partial Dependence Plots')
plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # Adjust layout to prevent overlap
plt.show()

Explanation of the Code

Import necessary libraries: GradientBoostingRegressor for the model, PartialDependenceDisplay for creating the plots, train_test_split for splitting the data, make_friedman1 for creating a sample dataset, and matplotlib.pyplot for plotting.
Generate a sample dataset: The make_friedman1 function creates a dataset with 10 features, where the target variable is a non-linear function of the first 5 features.
Train a Gradient Boosting Regressor: A GradientBoostingRegressor is trained on the training data.
Create the Partial Dependence Plot: The PartialDependenceDisplay.from_estimator function takes the trained model, the training data, and the features to plot as input. It generates the PDPs and displays them using matplotlib.pyplot.show().
Features argument: This list specifies the features for which we want to plot partial dependencies. It can contain single integers for individual features, or tuples of integers for two-way interactions. The interaction plots show how the effect of one feature depends on the value of the other.

Interpreting the PDPs

The y-axis of a PDP represents the change in the average predicted outcome as the selected feature(s) vary. It does not show the actual predicted values, but rather the relative change from some baseline. The x-axis represents the values of the feature(s) being analyzed.

Individual Feature Plots: These plots show the average effect of a single feature on the prediction. For example, a plot that slopes upwards indicates that increasing the feature value tends to increase the predicted outcome.
Two-Way Interaction Plots: These plots show how the effect of one feature depends on the value of another feature. They can reveal complex relationships that are not apparent from individual feature plots. The color intensity usually indicates the average prediction value.

Real-Life Use Case Section

Credit Risk Assessment: In credit risk modeling, PDPs can help understand how factors like income, credit score, and employment history influence the probability of loan default. Banks can use PDPs to identify which features have the most significant impact on creditworthiness and adjust their lending criteria accordingly.

By visualizing the partial dependence of loan default probability on income and credit score, a bank can understand whether a higher income compensates for a lower credit score or vice versa. This helps in making more informed lending decisions.

Best Practices

Choose relevant features: Focus on features that are likely to have a significant impact on the prediction or that are of particular interest from a business perspective.
Consider feature interactions: Explore two-way interactions to uncover complex relationships between features.
Use a sufficient number of data points: PDPs are based on averaging predictions, so a larger dataset will generally lead to more stable and reliable plots.
Be aware of extrapolation: PDPs can be misleading if the model is extrapolated to feature values outside the range seen in the training data.
Combine with other interpretability techniques: PDPs are just one tool in the interpretability toolbox. Combine them with feature importance measures, SHAP values, and other techniques for a more comprehensive understanding of the model.

Interview Tip

When discussing PDPs in an interview, be prepared to explain:

The underlying concept of marginalizing over other features.
How to interpret the plots.
The limitations of PDPs (e.g., they assume feature independence).
Real-world examples of how PDPs can be used.

Demonstrating a practical understanding of how to use and interpret PDPs will impress the interviewer.

When to Use Them

PDPs are particularly useful when you want to:

Understand the overall relationship between a feature and the prediction.
Compare the effects of different features.
Identify feature interactions.
Explain model predictions to stakeholders.

Memory Footprint

The memory footprint of calculating PDPs depends on the size of the dataset, the number of features, and the complexity of the model. The main memory usage comes from creating modified datasets for each value of the selected feature(s) and making predictions with the model. For very large datasets, consider using a subset of the data or more memory-efficient implementations.

Alternatives

Alternatives to PDPs include:

Individual Conditional Expectation (ICE) plots: ICE plots show the prediction for each individual instance as a function of the feature of interest. Unlike PDPs, ICE plots display a separate line for each sample, revealing heterogeneity in how different instances respond to changes in the feature.
SHAP values: SHAP (SHapley Additive exPlanations) values provide a more granular explanation of the contribution of each feature to each individual prediction.
Feature Importance: Feature importance scores provide a summary of how important each feature is to the model, but do not show the direction of the effect.

Pros

Easy to understand: PDPs are relatively easy to interpret, even for non-technical stakeholders.
Visualize feature effects: They provide a clear visualization of the relationship between a feature and the prediction.
Can reveal feature interactions: Two-way PDPs can uncover complex interactions between features.

Cons

Assume feature independence: PDPs assume that the features are independent, which is often not the case in real-world datasets. This can lead to misleading results.
Can be computationally expensive: Calculating PDPs can be computationally expensive for large datasets and complex models.
Averaged effects: PDPs show the average effect of a feature, which may not be representative of all instances.

← LIME: Understanding Your Machine Learning Models Understanding Feature Importance in Machine Learning Models →

FAQ

What is the difference between PDP and ICE plots?

PDPs show the average effect of a feature on the prediction, while ICE plots show the effect for each individual instance. ICE plots can reveal heterogeneity in how different instances respond to changes in the feature, which is hidden in PDPs.
How do PDPs handle categorical features?

For categorical features, the x-axis of the PDP represents the different categories. The PDP shows the average predicted outcome for each category.
Can PDPs be used for classification models?

Yes, PDPs can be used for classification models. In this case, the y-axis represents the predicted probability of belonging to a particular class.
How does feature dependence affect PDP interpretation?

PDPs assume feature independence. If features are highly correlated, changing one feature in the PDP while holding others constant might result in unrealistic or nonsensical data points, potentially leading to misinterpretations. Alternatives like Conditional Dependence Plots (CDPs) or considering feature interactions explicitly might be more appropriate in such cases.

Clustering Algorithms

Computer Vision

Data Handling for ML

Data Preprocessing

Deep Learning

Dimensionality Reduction

Ethics and Fairness in ML

Fundamentals of Machine Learning

Linear Models

ML in Production

Model Deployment

Model Evaluation and Selection

Model Interpretability

Natural Language Processing (NLP)

Neural Networks

Reinforcement Learning

Support Vector Machines

Time Series Forecasting

Tools and Libraries

Tree-based Models