Python > Data Science and Machine Learning Libraries > Scikit-learn > Unsupervised Learning (Clustering, Dimensionality Reduction)

K-Means Clustering with Scikit-learn

This snippet demonstrates how to use K-Means clustering from scikit-learn to group data points into clusters based on their proximity to cluster centers. We will generate synthetic data, apply K-Means, and visualize the results.

Import Libraries

This section imports necessary libraries:

  • numpy: For numerical operations, particularly array manipulation.
  • matplotlib.pyplot: For creating visualizations.
  • sklearn.cluster.KMeans: The K-Means clustering algorithm.
  • sklearn.datasets.make_blobs: A function to generate synthetic datasets for clustering.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

Generate Sample Data

We generate a synthetic dataset using make_blobs. This function creates clusters of data points that are normally distributed. The parameters are:

  • n_samples: The total number of points equally divided among clusters (300 in this case).
  • centers: The number of clusters (4 in this case).
  • cluster_std: The standard deviation of the clusters (0.60 in this case). This controls how spread out the clusters are.
  • random_state: A seed for the random number generator, ensuring reproducibility.

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

Initialize and Fit K-Means

Here, we initialize the K-Means algorithm and fit it to the data:

  • KMeans(n_clusters=4): Creates a KMeans object with 4 clusters. It is important to select a correct n_cluster.
  • init='k-means++': Specifies the method for initializing the cluster centers. 'k-means++' is a smart initialization technique that spreads out the initial centers to speed up convergence.
  • max_iter=300: The maximum number of iterations for the algorithm to run.
  • n_init=10: The number of times the K-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
  • random_state=0: For reproducibility.
  • kmeans.fit_predict(X): Fits the K-Means model to the data X and predicts the cluster assignments for each data point. The result y_kmeans is an array of cluster labels (0, 1, 2, or 3 in this case).

kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)

Visualize the Clusters

This section visualizes the results:

  • plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis'): Creates a scatter plot of the data points. X[:, 0] and X[:, 1] represent the first and second features of the data. c=y_kmeans colors the points according to their cluster assignments. s=50 sets the size of the points, and cmap='viridis' specifies the color map.
  • centers = kmeans.cluster_centers_: Retrieves the coordinates of the cluster centers from the fitted K-Means model.
  • plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75): Plots the cluster centers as red dots. s=200 sets the size of the centers, and alpha=0.75 sets the transparency.
  • The other lines add title and labels.

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75);
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Complete Code

This section provides the complete code for easy copy-pasting and execution.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Initialize and fit K-Means
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75);
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Concepts Behind the Snippet

K-Means clustering is a centroid-based algorithm that aims to partition n data points into k clusters, where each data point belongs to the cluster with the nearest mean (cluster center or centroid). The algorithm iteratively refines the cluster assignments until the cluster centers stabilize. It's a popular algorithm due to its simplicity and efficiency, but it assumes that the data has a somewhat spherical shape and that the number of clusters is known beforehand. The main idea is to minimize the within-cluster variance (sum of squared distances between data points and their cluster centroid).

Real-Life Use Case

  • Customer Segmentation: Grouping customers based on purchasing behavior, demographics, or other characteristics to tailor marketing campaigns.
  • Image Compression: Reducing the number of colors in an image by clustering similar colors together.
  • Anomaly Detection: Identifying unusual data points that don't fit into any cluster.

Best Practices

  • Scaling: Scale your data before applying K-Means, as it is sensitive to the scale of the features. Use StandardScaler or MinMaxScaler from scikit-learn.
  • Choosing K: Use techniques like the Elbow Method or Silhouette analysis to determine the optimal number of clusters.
  • Initialization: Use init='k-means++' for better initialization of cluster centers.
  • Multiple Runs: Run K-Means multiple times with different initializations (n_init > 1) to avoid getting stuck in a local minimum.

Interview Tip

Be prepared to discuss the assumptions of K-Means (e.g., spherical clusters, equal variance), its limitations (sensitivity to outliers, difficulty with non-convex clusters), and alternative clustering algorithms like DBSCAN or hierarchical clustering. Also, be ready to explain how to choose the optimal number of clusters.

When to Use Them

Use K-Means when:

  • You have a relatively large dataset.
  • The clusters are expected to be somewhat spherical and well-separated.
  • You have a good estimate of the number of clusters.
  • Interpretability is important.

Memory Footprint

The memory footprint of K-Means depends on the size of the dataset (number of samples and features), the number of clusters, and the number of iterations. For large datasets, consider using mini-batch K-Means (MiniBatchKMeans) to reduce memory consumption.

Alternatives

  • DBSCAN: Density-based clustering algorithm that doesn't require specifying the number of clusters. Suitable for non-convex clusters.
  • Hierarchical Clustering: Creates a hierarchy of clusters, allowing you to choose the desired level of granularity.
  • Gaussian Mixture Models (GMM): A probabilistic clustering algorithm that assumes the data is generated from a mixture of Gaussian distributions.

Pros

  • Simple to implement and understand.
  • Efficient for large datasets.
  • Scalable.

Cons

  • Sensitive to initial centroid positions.
  • Assumes clusters are spherical and equally sized.
  • Requires specifying the number of clusters (k) beforehand.
  • Sensitive to outliers.

FAQ

  • How do I choose the optimal number of clusters (k)?

    Use the Elbow Method or Silhouette analysis. The Elbow Method involves plotting the within-cluster sum of squares (WCSS) for different values of k and choosing the 'elbow' point where the WCSS starts to diminish less rapidly. Silhouette analysis calculates a silhouette coefficient for each data point, which measures how well it fits into its assigned cluster compared to other clusters. Choose the k that maximizes the average silhouette score.
  • What is the difference between K-Means and K-Means++?

    K-Means++ is an initialization technique for K-Means that helps to avoid poor cluster assignments due to random initialization. It selects initial cluster centers that are far apart from each other, leading to faster convergence and better clustering results. K-Means uses a random initialization by default.
  • What should I do if my data is not normally distributed?

    K-Means assumes that clusters are somewhat spherical and have equal variance. If your data is not normally distributed or has non-convex clusters, consider using alternative clustering algorithms like DBSCAN, hierarchical clustering, or Gaussian Mixture Models (GMMs).