Python > Data Science and Machine Learning Libraries > Scikit-learn > Model Selection and Evaluation
Grid Search for Hyperparameter Tuning with Scikit-learn
This snippet demonstrates how to use Grid Search with Scikit-learn to find the optimal hyperparameters for a machine learning model. Hyperparameter tuning is crucial for maximizing model performance. Grid search systematically explores a predefined set of hyperparameter values, evaluating the model's performance for each combination using cross-validation. This example uses a Support Vector Classifier (SVC) but the same process can be applied to different algorithms.
Import Necessary Libraries
This imports the necessary libraries. GridSearchCV
from sklearn.model_selection
is used for performing grid search. SVC
from sklearn.svm
is used as the model. load_iris
from sklearn.datasets
provides a sample dataset, and train_test_split
is for splitting the data.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
Load and Split the Data
Loads the Iris dataset and splits it into training and testing sets. test_size=0.3
means 30% of the data is used for testing. random_state
ensures reproducibility.
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
Define the Parameter Grid
This defines the grid of hyperparameters to search over. C
is the regularization parameter, gamma
is the kernel coefficient, and kernel
specifies the kernel type. Grid Search will try all possible combinations of these values.
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.01, 0.1, 1, 'scale'],
'kernel': ['rbf', 'linear']
}
Instantiate GridSearchCV
Instantiates the GridSearchCV
object. SVC()
is the model to be tuned. param_grid
is the parameter grid defined earlier. refit=True
refits the best model on the entire dataset after the search is complete. verbose=2
provides detailed output during the search process. cv=3
specifies 3-fold cross-validation.
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2, cv=3)
Fit the Grid Search
Fits the Grid Search to the training data. This will train and evaluate the model for each combination of hyperparameters in the grid.
grid.fit(X_train, y_train)
Print Results
Prints the best hyperparameters found by Grid Search, the best estimator (the model with the best hyperparameters), and the best score (the mean cross-validated score for the best estimator).
print(f'Best parameters: {grid.best_params_}')
print(f'Best estimator: {grid.best_estimator_}')
print(f'Best score: {grid.best_score_}')
Evaluate the Best Model on the Test Set
Evaluates the best model found by Grid Search on the test set and prints the accuracy. This provides an estimate of how well the model will generalize to unseen data.
accuracy = grid.score(X_test, y_test)
print(f'Test set accuracy: {accuracy}')
Concepts Behind the Snippet
Grid Search automates the process of finding the best hyperparameter combination for a model. It exhaustively searches through a grid of specified parameter values. For each combination, it performs cross-validation to estimate the model's performance. The hyperparameter combination that yields the best cross-validation score is selected as the optimal set of hyperparameters.
Real-Life Use Case Section
Consider optimizing a spam filter model. Using Grid Search, you can find the optimal combination of hyperparameters for a Support Vector Machine (SVM) classifier, such as the regularization parameter (C) and kernel type (e.g., linear, RBF). This helps improve the accuracy of the spam filter, reducing false positives and false negatives.
Best Practices
Interview Tip
Be prepared to discuss the trade-offs between different hyperparameter values. Explain how each hyperparameter affects the model's performance and generalization ability. Be ready to compare Grid Search with other hyperparameter optimization techniques, such as Randomized Search and Bayesian Optimization.
When to Use Them
Use Grid Search when you have a limited number of hyperparameters to tune and a relatively small search space. It's a good starting point for hyperparameter optimization, but for more complex models with a large number of hyperparameters, consider using Randomized Search or Bayesian Optimization.
Memory Footprint
The memory footprint of Grid Search depends on the size of the data, the complexity of the model, and the number of hyperparameter combinations being evaluated. Grid Search can be memory-intensive, especially when using cross-validation and storing multiple models.
Alternatives
Pros
Cons
FAQ
-
What is the difference between GridSearchCV and RandomizedSearchCV?
GridSearchCV
exhaustively searches all combinations in the provided parameter grid, whileRandomizedSearchCV
samples a specified number of parameter settings from the distributions.RandomizedSearchCV
is more efficient for high-dimensional parameter spaces. -
What does 'refit=True' do in GridSearchCV?
refit=True
means that once the best hyperparameters are found, the entire training dataset is used to retrain the model with those best hyperparameters. This results in a final model that has been trained on all available training data and is ready for deployment or further evaluation.