Python > Data Science and Machine Learning Libraries > Scikit-learn > Unsupervised Learning (Clustering, Dimensionality Reduction)
K-Means Clustering with Scikit-learn
This snippet demonstrates how to use K-Means clustering from scikit-learn to group data points into clusters based on their proximity to cluster centers. We will generate synthetic data, apply K-Means, and visualize the results.
Import Libraries
This section imports necessary libraries:
numpy
: For numerical operations, particularly array manipulation.matplotlib.pyplot
: For creating visualizations.sklearn.cluster.KMeans
: The K-Means clustering algorithm.sklearn.datasets.make_blobs
: A function to generate synthetic datasets for clustering.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
Generate Sample Data
We generate a synthetic dataset using make_blobs
. This function creates clusters of data points that are normally distributed. The parameters are:
n_samples
: The total number of points equally divided among clusters (300 in this case).centers
: The number of clusters (4 in this case).cluster_std
: The standard deviation of the clusters (0.60 in this case). This controls how spread out the clusters are.random_state
: A seed for the random number generator, ensuring reproducibility.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
Initialize and Fit K-Means
Here, we initialize the K-Means algorithm and fit it to the data:
KMeans(n_clusters=4)
: Creates a KMeans object with 4 clusters. It is important to select a correct n_cluster.init='k-means++'
: Specifies the method for initializing the cluster centers. 'k-means++' is a smart initialization technique that spreads out the initial centers to speed up convergence.max_iter=300
: The maximum number of iterations for the algorithm to run.n_init=10
: The number of times the K-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.random_state=0
: For reproducibility.kmeans.fit_predict(X)
: Fits the K-Means model to the data X
and predicts the cluster assignments for each data point. The result y_kmeans
is an array of cluster labels (0, 1, 2, or 3 in this case).
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)
Visualize the Clusters
This section visualizes the results:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
: Creates a scatter plot of the data points. X[:, 0]
and X[:, 1]
represent the first and second features of the data. c=y_kmeans
colors the points according to their cluster assignments. s=50
sets the size of the points, and cmap='viridis'
specifies the color map.centers = kmeans.cluster_centers_
: Retrieves the coordinates of the cluster centers from the fitted K-Means model.plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
: Plots the cluster centers as red dots. s=200
sets the size of the centers, and alpha=0.75
sets the transparency.
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75);
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Complete Code
This section provides the complete code for easy copy-pasting and execution.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Initialize and fit K-Means
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)
# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75);
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Concepts Behind the Snippet
K-Means clustering is a centroid-based algorithm that aims to partition n data points into k clusters, where each data point belongs to the cluster with the nearest mean (cluster center or centroid). The algorithm iteratively refines the cluster assignments until the cluster centers stabilize. It's a popular algorithm due to its simplicity and efficiency, but it assumes that the data has a somewhat spherical shape and that the number of clusters is known beforehand. The main idea is to minimize the within-cluster variance (sum of squared distances between data points and their cluster centroid).
Real-Life Use Case
Best Practices
StandardScaler
or MinMaxScaler
from scikit-learn.init='k-means++'
for better initialization of cluster centers.n_init
> 1) to avoid getting stuck in a local minimum.
Interview Tip
Be prepared to discuss the assumptions of K-Means (e.g., spherical clusters, equal variance), its limitations (sensitivity to outliers, difficulty with non-convex clusters), and alternative clustering algorithms like DBSCAN or hierarchical clustering. Also, be ready to explain how to choose the optimal number of clusters.
When to Use Them
Use K-Means when:
Memory Footprint
The memory footprint of K-Means depends on the size of the dataset (number of samples and features), the number of clusters, and the number of iterations. For large datasets, consider using mini-batch K-Means (MiniBatchKMeans
) to reduce memory consumption.
Alternatives
Pros
Cons
FAQ
-
How do I choose the optimal number of clusters (k)?
Use the Elbow Method or Silhouette analysis. The Elbow Method involves plotting the within-cluster sum of squares (WCSS) for different values of k and choosing the 'elbow' point where the WCSS starts to diminish less rapidly. Silhouette analysis calculates a silhouette coefficient for each data point, which measures how well it fits into its assigned cluster compared to other clusters. Choose the k that maximizes the average silhouette score. -
What is the difference between K-Means and K-Means++?
K-Means++ is an initialization technique for K-Means that helps to avoid poor cluster assignments due to random initialization. It selects initial cluster centers that are far apart from each other, leading to faster convergence and better clustering results. K-Means uses a random initialization by default. -
What should I do if my data is not normally distributed?
K-Means assumes that clusters are somewhat spherical and have equal variance. If your data is not normally distributed or has non-convex clusters, consider using alternative clustering algorithms like DBSCAN, hierarchical clustering, or Gaussian Mixture Models (GMMs).