Machine learning > Fundamentals of Machine Learning > Key Concepts > Underfitting

Understanding Underfitting in Machine Learning

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the training data. This leads to poor performance on both the training data and unseen data. This tutorial will explore the concept of underfitting, its causes, consequences, and methods to mitigate it.

Defining Underfitting

Underfitting happens when a model fails to learn the underlying relationship between input features and the target variable. It typically results from using a simple model (e.g., linear regression on a non-linear dataset) or not providing enough features to the model. The model's accuracy is low on both the training set and the test set.

Causes of Underfitting

Several factors can contribute to underfitting:

  • Model Complexity: Using a model that is too simple for the data. For instance, fitting a linear model to data that exhibits a complex, non-linear relationship.
  • Insufficient Features: Not providing the model with enough relevant features to learn from. If crucial predictive variables are missing, the model will struggle to capture the underlying patterns.
  • Excessive Regularization: Applying strong regularization techniques (e.g., L1 or L2 regularization) that constrain the model too much, preventing it from learning the data's intricacies.
  • Over-simplification: Using models with a high bias, tending to simplify the data, therefore not capturing the underlying relationships that exist between the features and the target.

Consequences of Underfitting

The main consequence of underfitting is poor predictive performance. An underfit model will have:

  • Low Training Accuracy: The model doesn't fit the training data well.
  • Low Test Accuracy: The model generalizes poorly to unseen data.
  • High Bias: The model makes strong assumptions about the data that are not correct.

Identifying Underfitting

You can identify underfitting by observing the performance of your model. Specifically, look for the following signs:

  • Consistently Low Accuracy: If your model achieves low accuracy on both the training and validation datasets.
  • Large Gap Between Expected and Actual Performance: If you have a strong prior belief about the relationship between features and target but your model performs much worse than expected.
  • Learning Curves: Plotting the training and validation loss curves can reveal underfitting. If both curves plateau at a high loss value, it indicates underfitting.

Code Example: Demonstrating Underfitting

This code generates non-linear data and then attempts to fit a linear regression model to it. The resulting plot and MSE scores will clearly show that the linear model is unable to capture the underlying pattern, demonstrating underfitting. The MSE on both training and test sets will be high.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate some non-linear data
X = np.linspace(-5, 5, 100).reshape(-1, 1)
y = 2 * X**2 + 3 * X + 1 + np.random.randn(100, 1) * 10  # Add some noise

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit a linear regression model (underfitting)
model = LinearRegression()
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Evaluate the model
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f'Training Mean Squared Error: {train_mse}')
print(f'Testing Mean Squared Error: {test_mse}')

# Plot the data and the model's predictions
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Data')
plt.plot(X, model.predict(X), color='red', label='Linear Regression Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Underfitting Example: Linear Regression on Non-linear Data')
plt.legend()
plt.show()

Concepts Behind the Snippet

The code snippet demonstrates the principle of model selection. Choosing the right model complexity is critical. Linear regression assumes a linear relationship between the input and output. When the data exhibits a non-linear pattern, a linear model will inevitably underfit. The Mean Squared Error (MSE) is used to quantify the error between the predicted and actual values. A high MSE indicates a poor fit.

Real-Life Use Case

Consider predicting housing prices based solely on square footage using a linear model when other factors like location, number of bedrooms, and age of the house significantly influence the price. Using just square footage (a single feature, simple model) would lead to underfitting because it fails to capture the complexities of the housing market.

Solutions to Underfitting

There are several ways to address underfitting:

  • Increase Model Complexity: Use a more complex model that can capture non-linear relationships. For example, switch from linear regression to polynomial regression, decision trees, or neural networks.
  • Feature Engineering: Add more relevant features to the model. This could involve creating new features from existing ones or incorporating external data sources.
  • Reduce Regularization: If using regularization techniques, reduce the regularization strength to allow the model to learn more complex patterns.
  • Use More Data: While not always a solution in itself, providing more data can sometimes help a more complex model learn more effectively.

Code Example: Fixing Underfitting with Polynomial Regression

This code builds on the previous example by using PolynomialFeatures to transform the input data into polynomial features (in this case, degree 2). A linear regression model is then fit to these transformed features. This allows the model to capture the non-linear relationship in the data, significantly reducing underfitting and improving accuracy. The MSE will be much lower compared to the previous linear regression example.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate some non-linear data
X = np.linspace(-5, 5, 100).reshape(-1, 1)
y = 2 * X**2 + 3 * X + 1 + np.random.randn(100, 1) * 10  # Add some noise

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Transform features to polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Fit a linear regression model on the polynomial features
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)

# Evaluate the model
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

print(f'Training Mean Squared Error: {train_mse}')
print(f'Testing Mean Squared Error: {test_mse}')

# Plot the data and the model's predictions
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Data')
X_plot = np.linspace(-5, 5, 100).reshape(-1, 1)
X_plot_poly = poly.transform(X_plot)
y_plot_pred = model.predict(X_plot_poly)
plt.plot(X_plot, y_plot_pred, color='red', label='Polynomial Regression Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression on Non-linear Data')
plt.legend()
plt.show()

Best Practices

  • Start with a Simple Model: Begin with a simple model and gradually increase complexity.
  • Monitor Training and Validation Performance: Keep track of the model's performance on both the training and validation datasets to identify underfitting or overfitting.
  • Use Cross-Validation: Employ cross-validation to obtain a more robust estimate of the model's generalization performance.
  • Feature Importance: Analyze feature importance to identify and potentially discard irrelevant or redundant features.

Interview Tip

When discussing underfitting in an interview, be prepared to explain the concept in simple terms, provide examples of its causes and consequences, and describe methods for mitigating it. Highlight your understanding of the trade-off between model complexity and generalization ability. Be ready to discuss specific algorithms and techniques like polynomial regression or adding interaction terms as ways to address underfitting.

When to Use Them

Understanding when a model is underfitting can guide model selection and adjustments. An underfitting model isn't always bad, but it does mean that the model is not capturing all the available information. When the goal is to predict as precisely as possible based on the available data, addressing underfitting can improve accuracy and reliability.

Alternatives

Alternatives to address underfitting depend on the specific situation. Besides increasing model complexity and adding features, consider using ensemble methods (e.g., Random Forest, Gradient Boosting) that can combine multiple weak learners to create a stronger model. Also, explore different data preprocessing techniques (e.g., scaling, normalization) that might improve the model's ability to learn.

Pros

While underfitting is undesirable, very simple models are fast to train, easily interpretable and require minimal resources. These models work as a baseline performance measure.

Cons

The main con is poor accuracy and generalization ability. Underfitting models fail to capture the underlying relationships and patterns in the data. A model that underfits the training data is likely to perform poorly on new, unseen data.

FAQ

  • What is the difference between underfitting and overfitting?

    Underfitting occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and test sets. Overfitting occurs when a model is too complex and learns the noise in the training data, resulting in excellent performance on the training set but poor performance on the test set.
  • How can I tell if my model is underfitting?

    You can identify underfitting by observing low accuracy on both the training and validation datasets, and by analyzing learning curves that plateau at a high loss value.
  • Is underfitting always a bad thing?

    Generally, yes. While simple models can be computationally efficient, the goal of most machine learning tasks is to build a model that accurately predicts outcomes. Underfitting indicates that the model is not effectively learning from the data, thus sacrificing predictive accuracy. In certain situations when interpretability is paramount and high accuracy is not required, a simpler, underfit model might be acceptable.