Python > Data Science and Machine Learning Libraries > Scikit-learn > Supervised Learning (Classification, Regression)

Linear Regression with Scikit-learn

This snippet demonstrates how to perform linear regression using Scikit-learn. Linear regression is a fundamental supervised learning algorithm used for predicting a continuous target variable based on one or more predictor variables.

Import Necessary Libraries

This section imports the required libraries. `numpy` is used for numerical operations, `LinearRegression` from `sklearn.linear_model` is the linear regression model, `train_test_split` from `sklearn.model_selection` is used to split the dataset into training and testing sets, and `mean_squared_error` from `sklearn.metrics` is used to evaluate the model's performance.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Generate Sample Data

Here, we create a simple dataset with one feature (X) and a corresponding target variable (y). In real-world scenarios, you would load your data from a file (e.g., CSV) or a database. The data is converted to a numpy array, which is a requirement for scikit-learn.

X = np.array([[1], [2], [3], [4], [5]])  # Input features
y = np.array([2, 4, 5, 4, 5])  # Target variable

Split Data into Training and Testing Sets

We split the data into training and testing sets using `train_test_split`. `test_size=0.2` means 20% of the data will be used for testing, and `random_state=42` ensures reproducibility. Training data is used to train/fit the model, and test data is used to evaluate its performance on unseen data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create and Train the Linear Regression Model

This creates an instance of the `LinearRegression` model and trains it using the training data. The `fit` method learns the relationship between the input features (X_train) and the target variable (y_train).

model = LinearRegression()
model.fit(X_train, y_train)

Make Predictions

After training the model, we use it to make predictions on the test data (X_test). The `predict` method returns the predicted values for the target variable.

y_pred = model.predict(X_test)

Evaluate the Model

We evaluate the model's performance using the mean squared error (MSE). MSE measures the average squared difference between the predicted values and the actual values. A lower MSE indicates better performance.

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Complete Code

This section provides the complete code for the linear regression example. It includes all the steps from importing libraries to evaluating the model's performance.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate sample data
X = np.array([[1], [2], [3], [4], [5]])  # Input features
y = np.array([2, 4, 5, 4, 5])  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Concepts Behind the Snippet

This snippet demonstrates linear regression, a supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. Key concepts include: Features: Input variables used to predict the target. Target Variable: The variable we are trying to predict. Training Data: Data used to train the model. Testing Data: Data used to evaluate the model's performance on unseen data. Mean Squared Error (MSE): A common metric used to evaluate the performance of regression models.

Real-Life Use Case

Linear regression can be used in various real-life scenarios, such as: Predicting house prices based on features like square footage and location. Forecasting sales based on advertising spend. Estimating customer lifetime value based on demographics and purchase history. Predicting stock prices.

Best Practices

Some best practices to keep in mind when using linear regression include: Ensure that the relationship between the features and the target variable is linear. Handle outliers appropriately, as they can significantly impact the model's performance. Consider feature scaling if the features have different scales. Regularly evaluate the model's performance using appropriate metrics.

Interview Tip

When discussing linear regression in interviews, be prepared to explain the assumptions of the model (linearity, independence of errors, homoscedasticity, normality of errors), how to handle multicollinearity, and common evaluation metrics like MSE, R-squared, and adjusted R-squared.

When to Use Them

Linear Regression works best when the relationship between the independent and dependent variables is approximately linear. It's a good starting point when you have a continuous target variable and want to understand the relationship between features and the target. If the relationship is non-linear, consider using other models like polynomial regression or non-linear models.

Memory Footprint

Linear Regression typically has a relatively small memory footprint, especially when dealing with a moderate number of features. The memory usage primarily depends on the size of the input data and the number of coefficients in the model.

Alternatives

Alternatives to Linear Regression include: Polynomial Regression: For non-linear relationships. Decision Tree Regression: For complex relationships with interactions between features. Support Vector Regression (SVR): For non-linear relationships and high-dimensional data. Random Forest Regression: An ensemble method that combines multiple decision trees. Gradient Boosting Regression: Another ensemble method that sequentially builds models to improve performance.

Pros

Advantages of Linear Regression: Simple to understand and implement. Computationally efficient. Provides insights into the relationship between features and the target. Can be used as a baseline model for more complex models.

Cons

Disadvantages of Linear Regression: Assumes a linear relationship between features and the target. Sensitive to outliers. Can suffer from multicollinearity if features are highly correlated. May not perform well on complex datasets with non-linear relationships.

FAQ

  • What is the difference between training and testing data?

    Training data is used to train the model, allowing it to learn the relationship between the input features and the target variable. Testing data is used to evaluate the model's performance on unseen data, providing an estimate of how well the model will generalize to new data.
  • What is mean squared error (MSE)?

    Mean squared error (MSE) is a metric used to evaluate the performance of regression models. It measures the average squared difference between the predicted values and the actual values. A lower MSE indicates better performance.
  • How can I improve the performance of a linear regression model?

    Several techniques can improve linear regression model performance, including: Feature engineering (creating new features or transforming existing ones). Regularization (adding a penalty term to prevent overfitting). Handling outliers (removing or transforming outliers). Addressing multicollinearity (removing highly correlated features or using techniques like principal component analysis).