DEV Community

Hierarchical Clustering

The Problem It Solves

Many clustering algorithms split data into a fixed number of groups. Once a data point belongs to a cluster, that's the end of the story. But real-world data is rarely that simple. Customers, products, organizations, and even biological species naturally form nested relationships.

For example: A startup grows into a mid-sized company. Several mid-sized companies belong to the same enterprise. That enterprise belongs to an industry. These relationships exist at multiple levels.

Hierarchical Clustering is designed to discover this hierarchy. Instead of producing one flat grouping, it builds an entire tree that shows how clusters gradually merge together. This lets you analyze your data at different levels of detail.

Core Intuition

Imagine you're organizing hundreds of old family photographs. You don't immediately decide there should be exactly five albums. Instead, you start by finding the two most similar photos. Perhaps they're from the same birthday party. You group them together. Next, you find another similar pair. Eventually those small groups begin joining together. Birthday albums merge into childhood albums. Childhood albums merge into family collections. Eventually every photo belongs to one giant family archive.

Now imagine drawing a horizontal line across that family tree. Wherever you cut the tree determines how many groups you end up with. Cut lower โ†’ many small groups. Cut higher โ†’ fewer large groups.

That's exactly how Hierarchical Clustering works. It builds the entire hierarchy first. You decide later how many clusters you actually want.

How the Algorithm Works

The most common version is Agglomerative Hierarchical Clustering, which follows a bottom-up approach.

Step 1 - Start with Individual Clusters
Initially, every single observation forms its own cluster. If you have 500 samples, you begin with 500 clusters.

Step 2 - Compute Pairwise Distances
The algorithm calculates the distance between every pair of clusters. For individual points, this is usually Euclidean Distance. This produces a complete distance matrix.

Step 3 - Merge the Closest Clusters
The two closest clusters are merged together. After merging, the distance matrix is updated. The algorithm repeats this process until every observation belongs to one single cluster.

Step 4 - Build the Dendrogram
Every merge is recorded. The result is a tree called a Dendrogram. The height of each branch represents the distance at which clusters were merged. The taller the merge, the less similar those clusters were.

Linkage Methods

One important question remains: How do we measure the distance between two clusters? This depends on the Linkage Method.

Single Linkage
Measures the distance between the two closest points. It tends to create long chain-like clusters. Useful for detecting irregular shapes. Prone to chaining.

Complete Linkage
Measures the distance between the two farthest points. Produces compact clusters. More resistant to noise.

Average Linkage
Uses the average distance between all pairs of points. Provides a balance between Single and Complete Linkage.

Ward's Linkage
Rather than measuring distance directly, Ward's method chooses the merge that causes the smallest increase in within-cluster variance. This usually produces balanced, compact clusters. It is the most commonly used linkage in machine learning.

What Is the Algorithm Optimizing?

Unlike K-Means, Hierarchical Clustering doesn't optimize a global objective function. Instead, it greedily builds the cluster hierarchy based on local merge decisions. The quality of the resulting tree is often evaluated using the Cophenetic Correlation Coefficient, which measures how well the dendrogram preserves the original pairwise distances.

When Should You Use Hierarchical Clustering?

Hierarchical Clustering works well when:

  • The natural hierarchy matters.
  • You don't know the correct number of clusters beforehand.
  • The dataset is relatively small.
  • Understanding relationships is more important than prediction speed.

Typical applications include:

  • Customer segmentation
  • Biological taxonomy
  • Document organization
  • Gene expression analysis
  • Product categorization
  • Organizational structure analysis

Advantages

Hierarchical Clustering offers several important benefits.

  • No need to specify the number of clusters beforehand.
  • Produces a complete hierarchy of relationships.
  • Easy to visualize using a dendrogram.
  • Works with many different distance metrics.
  • Flexible through different linkage methods.

When It Starts Breaking Down

Like every algorithm, Hierarchical Clustering has its limitations.

Poor Scalability
This algorithm must compute and store every pairwise distance.
Time Complexity: O(Nยณ)
Memory Complexity: O(Nยฒ)
As datasets grow, memory usage increases rapidly. For hundreds of thousands of observations, Hierarchical Clustering becomes impractical.

Sensitive to Noise
A few noisy observations can significantly change the structure of the tree, especially with Single Linkage.

Greedy Decisions
Once two clusters are merged, that decision can never be undone. If an early merge is incorrect, the algorithm carries that mistake throughout the rest of the hierarchy.

Linkage Selection Matters
Different linkage methods can produce completely different dendrograms on the same dataset. Choosing the wrong linkage may hide the true underlying structure.

Python Implementation

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.preprocessing import StandardScaler

# Create sample dataset
np.random.seed(42)
X = np.vstack([
    np.random.normal([3, 1], [1, 0.5], (15, 2)),
    np.random.normal([15, 8], [2, 1.5], (15, 2)),
    np.random.normal([80, 25], [10, 4], (15, 2))
])
df = pd.DataFrame(X, columns=["Seat_Count", "Support_Tickets"])

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Build hierarchy using Ward linkage
Z = linkage(X_scaled, method="ward")

# Cut the tree into 3 clusters
df["Cluster"] = fcluster(Z, t=3, criterion="maxclust")

print(df["Cluster"].value_counts())
print("\nCluster Means")
print(df.groupby("Cluster").mean())

How to Evaluate Hierarchical Clustering

Since Hierarchical Clustering is unsupervised, there is no prediction accuracy. Instead, we evaluate the quality of the hierarchy.

Dendrogram
The dendrogram is the primary evaluation tool. Large vertical jumps indicate natural places to cut the tree into clusters.

Cophenetic Correlation Coefficient
Measures how faithfully the dendrogram preserves the original pairwise distances. Higher values indicate a better hierarchy.

Silhouette Score
Once a cut is chosen, Silhouette Score can measure how well-separated the resulting clusters are. Higher values indicate cleaner clusters.

Real-World Engineering Notes

Some practical lessons you'll quickly discover:

  • Always standardize features before clustering.
  • Ward Linkage generally produces the most stable business clusters.
  • Hierarchical Clustering is excellent for exploratory analysis but rarely used on massive production datasets.
  • Dendrograms become unreadable once the dataset grows beyond a few thousand observations.
  • For very large datasets, K-Means or DBSCAN are usually better choices.

Hierarchical Clustering vs K-Means

Although both perform clustering, they work very differently.

Hierarchical Clustering K-Means
Builds a hierarchy Produces flat clusters
No need to specify K initially Must choose K beforehand
Creates a dendrogram Creates centroids
Slower on large datasets Much faster
Better for relationship discovery Better for large-scale segmentation

Key Takeaways

Hierarchical Clustering is an unsupervised clustering algorithm that builds a nested hierarchy of groups. It starts with every observation as its own cluster and repeatedly merges the closest clusters until only one remains. The resulting dendrogram allows you to choose the number of clusters after training rather than before. Different linkage methods produce different cluster structures, with Ward Linkage being the most common. The algorithm works well for small to medium datasets where understanding relationships is more important than computational efficiency. Its biggest limitation is scalability, as both memory and computation grow rapidly with dataset size.

Comments

No comments yet. Start the discussion.