Why Look Beyond K-Means?
In the previous chapter you learned K-Means Clustering: pick k, drop k centroids, assign each point to its nearest centroid, repeat. It is fast and famous — but it makes two big assumptions that often fail in real data.
- It needs you to know
kin advance. - It assumes clusters are round, roughly equal-sized blobs, because every point simply joins the nearest centre.
Real data is messier. A rider-density map of Bengaluru is not made of tidy circles; customer segments do not come in equal sizes; and fraud or sensor faults show up as outliers that k-means is forced to shove into some cluster anyway. This chapter covers two methods that relax those assumptions.
- Hierarchical clustering builds a tree of clusters and lets you decide
kafter looking at the structure — no upfront guess needed. - DBSCAN groups points by density, so it finds arbitrary shapes (crescents, rings, snakes) and explicitly labels sparse points as noise / outliers instead of forcing them into a cluster.
Intuitive analogy. Imagine organising a wedding guest list. Hierarchical thinking is genealogical: start with individuals, merge siblings into families, families into extended families, and so on up to the whole clan — then "cut" the family tree at whatever level of grouping you want. DBSCAN thinking is like spotting friend-circles at the reception: wherever people are packed tightly together, that is a group; a couple standing alone in a corner belongs to no group at all — they are noise, and that is a perfectly valid answer.
Both are unsupervised — there are no labels, only the structure the data reveals.
Hierarchical (Agglomerative) Clustering
The most common form is agglomerative (bottom-up) hierarchical clustering. It works like this:
1. Start: every point is its own cluster (n points → n clusters).
2. Find the two CLOSEST clusters and merge them into one.
3. Repeat step 2 — each round reduces the cluster count by one.
4. Stop when everything is in a single cluster (1 big cluster).
Because it records the entire sequence of merges, you get a full hierarchy. You never had to commit to a number of clusters while running the algorithm — you choose k afterwards by deciding where to "cut" the tree.
There is also a top-down variant called divisive clustering (start with one cluster, keep splitting), but it is rarely used in practice, so we focus on the agglomerative version.
Linkage: How "Distance Between Clusters" Is Defined
Step 2 says "find the two closest clusters" — but what does distance between two groups of points mean? That choice is called the linkage criterion, and it dramatically changes the shapes you get. Let A and B be two clusters.
Single linkage : distance = MIN distance between any point in A and any point in B
Complete linkage : distance = MAX distance between any point in A and any point in B
Average linkage : distance = AVERAGE of all pairwise A–B distances
Ward's linkage : merge the pair that increases total within-cluster variance the LEAST
Each behaves differently:
| Linkage | How it measures cluster distance | Tends to produce | Watch out for |
|---|---|---|---|
| Single | Nearest pair (min) | Long, straggly "chains"; can follow curved shapes | Chaining — two blobs joined by a thin bridge merge into one |
| Complete | Farthest pair (max) | Compact, roughly equal-diameter clusters | Sensitive to outliers (one far point inflates the max) |
| Average | Mean of all pairs | A balance between single and complete | Less interpretable than the extremes |
| Ward | Minimises variance increase | Compact, similar-sized clusters (very k-means-like) | Requires Euclidean distance; assumes spherical-ish clusters |
Ward's method is the most popular default for general use because it produces clean, balanced clusters. Single linkage is the one to reach for when clusters are genuinely elongated or non-convex — but beware the chaining effect.
The Dendrogram: Reading and Cutting the Tree
Every agglomerative run produces a dendrogram — a tree diagram where the height of each merge equals the distance at which two clusters joined. Low merges = very similar points joined early; high merges = dissimilar groups joined late.
Height (merge distance)
|
| ┌──────────────┐ <- tall bar: two DISSIMILAR
| │ │ groups merge here
| ┌──────┘ ┌────┘
| ┌──┘ ┌──┘
| ┌──┴─┐ ┌───┴──┐
| ┌─┴┐ ┌─┴┐ ┌─┴─┐ ┌─┴─┐
P1 P2 P3 P4 P5 P6 P7 P8 <- individual points at the bottom
To pick k, draw a horizontal cut across the dendrogram. The number of vertical lines the cut crosses is the number of clusters.
Rule of thumb for choosing where to cut:
→ Cut across the TALLEST vertical gap that no horizontal merge crosses.
→ A tall gap means: "merging further would join very dissimilar groups,"
so stopping there gives well-separated clusters.
This is the hierarchical answer to k-means' elbow method: instead of running many models, you inspect one tree and read k straight off it.
Hierarchical Clustering in scikit-learn
AgglomerativeClustering does the clustering; SciPy is used separately to draw the dendrogram.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
# Customers of an Indian D2C brand: annual spend (₹000) and visits/month
data = pd.DataFrame({
"annual_spend": [12, 15, 11, 88, 92, 85, 45, 48, 43, 90],
"visits": [ 2, 3, 2, 14, 15, 13, 7, 8, 6, 16],
})
# ALWAYS scale first — distance-based methods are sensitive to feature units
X = StandardScaler().fit_transform(data)
# n_clusters=3 chosen after inspecting the dendrogram (see next snippet)
model = AgglomerativeClustering(n_clusters=3, linkage="ward")
labels = model.fit_predict(X)
data["cluster"] = labels
print(data)
annual_spend visits cluster
0 12 2 1
1 15 3 1
2 11 2 1
3 88 14 0
4 92 15 0
5 85 13 0
6 45 7 2
7 48 8 2
8 43 6 2
9 90 16 0
Three clear segments emerge: low-spend/low-visit (1), mid-tier (2), and high-value loyalists (0). Note there is no .predict() for new points — agglomerative clustering has no cluster "centres" to compare against, so to label a new customer you must re-fit or fall back to a nearest-neighbour rule.
To choose k visually, draw the dendrogram with SciPy:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, method="ward") # the full merge history
plt.figure(figsize=(8, 4))
dendrogram(Z)
plt.axhline(y=6, color="red", linestyle="--") # horizontal "cut" line
plt.xlabel("Sample index")
plt.ylabel("Merge distance")
plt.title("Ward Dendrogram — cut crosses 3 lines → k = 3")
plt.show()
The linkage matrix Z from SciPy is the raw record of every merge; AgglomerativeClustering gives you the final labels. In practice you plot the dendrogram, read off k, then set n_clusters=k.
DBSCAN: Density-Based Clustering
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. Instead of centres or trees, it uses a simple, powerful idea: a cluster is a dense region of points separated from other dense regions by sparse space. Points in sparse space are labelled noise.
It has just two parameters:
eps : the radius of a point's neighbourhood (how close counts as "near")
min_samples : the minimum number of points required within eps
(including the point itself) to call a region "dense"
Every point falls into one of three roles:
Core point : has AT LEAST min_samples points within distance eps.
(It sits in the dense interior of a cluster.)
Border point : within eps of a core point, but does NOT itself have
min_samples neighbours. (It sits on a cluster's edge.)
Noise point : neither core nor border — too isolated to belong anywhere.
DBSCAN labels these as -1.
The algorithm then grows clusters outward from core points: start at an unvisited core point, sweep in every core point reachable through overlapping eps-neighbourhoods, attach the border points on the fringe, and stop. Whatever is left over is noise.
Why DBSCAN Shines
- No
krequired. The number of clusters emerges from the density structure. - Arbitrary shapes. Because clusters grow by connectivity, DBSCAN happily traces crescents, rings, and spirals that k-means would slice in half.
- Built-in outlier detection. Sparse points become label
-1— you never have to force a lone anomaly into a cluster. This makes DBSCAN a genuinely useful fraud / fault-detection tool.
DBSCAN in scikit-learn
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# Two interleaving crescents — the classic case k-means fails on
X_raw, _ = make_moons(n_samples=300, noise=0.06, random_state=42)
X = StandardScaler().fit_transform(X_raw)
db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print("Clusters found:", n_clusters)
print("Noise points:", n_noise)
Clusters found: 2
Noise points: 4
DBSCAN recovers both crescents correctly and flags a handful of stray points as noise — something k-means with k=2 cannot do, because it would cut each moon straight down the middle. Notice we never told DBSCAN there were two clusters; it discovered that from the data.
Choosing eps and min_samples
These two knobs make or break DBSCAN, so tune them deliberately.
min_samples : rule of thumb → at least (number_of_dimensions + 1).
A common default is 4 or 5 for 2-D data. Larger values
make the algorithm more conservative (more noise).
eps : use a "k-distance plot". For every point, compute the
distance to its k-th nearest neighbour (k = min_samples),
sort those distances ascending, and plot them. The sharp
"elbow / knee" in the curve is a good value for eps.
from sklearn.neighbors import NearestNeighbors
import numpy as np
k = 5 # match your intended min_samples
nn = NearestNeighbors(n_neighbors=k).fit(X)
distances, _ = nn.kneighbors(X)
kth = np.sort(distances[:, -1]) # distance to the k-th neighbour, sorted
# Plot `kth`; the y-value at the sharp upward bend ≈ good eps
# (a flat stretch that suddenly rises marks the density boundary)
Because DBSCAN uses a single global eps, it struggles when clusters have very different densities — one eps cannot be tight enough for a dense cluster and loose enough for a sparse one at the same time. The variant HDBSCAN (a separate library) addresses this by allowing variable density, and is worth knowing about once the basics click.
When Does Each Method Shine?
None of these three is "best" — they encode different assumptions. Match the method to your data.
| Scenario | Best fit | Why |
|---|---|---|
You already know k; clusters are round and similar-sized; data is large | K-Means | Fast, scales to millions of rows, simple to explain |
You do not know k and want to explore structure at multiple levels | Hierarchical | The dendrogram reveals nested structure and lets you pick k after the fact |
| Clusters are elongated, curved, or wildly non-spherical | DBSCAN | Grows clusters by density/connectivity, not by distance to a centre |
| Data contains outliers you want isolated, not absorbed | DBSCAN | Labels sparse points as noise (-1) instead of forcing membership |
| Clusters have very different densities | K-Means or HDBSCAN | Single-eps DBSCAN handles one density well; k-means or HDBSCAN cope better |
| Small dataset, interpretability matters (e.g. taxonomy) | Hierarchical | The tree is a rich, human-readable artefact |
A quick head-to-head on the properties people care about most:
| Property | K-Means | Hierarchical (Agglomerative) | DBSCAN |
|---|---|---|---|
Need to pre-specify k? | Yes | No (choose by cutting the tree) | No |
| Cluster shape it finds | Spherical / convex | Depends on linkage (Ward is convex) | Arbitrary |
| Handles noise / outliers? | No (forces every point in) | No (assigns every point) | Yes (labels as -1) |
| Main parameters | k | n_clusters, linkage | eps, min_samples |
| Can label brand-new points? | Yes (.predict) | No (re-fit needed) | No (approximate re-fit) |
| Typical time complexity | O(n · k · iters) | O(n^2) to O(n^3) | roughly O(n log n) with an index |
Scales to very large n? | Excellent | Poor (memory/time heavy) | Moderate to good |
The scalability row is the practical dealbreaker: standard agglomerative clustering stores an n x n distance matrix, so a few hundred thousand rows can exhaust memory. K-means and DBSCAN (with a spatial index) scale far better.
A Side-by-Side Worked Comparison
Here is the same non-spherical dataset run through all three, so you can see the difference in one place.
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
X_raw, _ = make_moons(n_samples=400, noise=0.07, random_state=0)
X = StandardScaler().fit_transform(X_raw)
km = KMeans(n_clusters=2, n_init=10, random_state=0).fit_predict(X)
agg = AgglomerativeClustering(n_clusters=2, linkage="single").fit_predict(X)
db = DBSCAN(eps=0.25, min_samples=5).fit_predict(X)
for name, lbl in [("KMeans", km), ("Agglo-single", agg), ("DBSCAN", db)]:
n_clusters = len(set(lbl)) - (1 if -1 in lbl else 0)
print(f"{name:14s} clusters={n_clusters} noise={(lbl == -1).sum()}")
KMeans clusters=2 noise=0 <- cuts each moon in half (wrong grouping)
Agglo-single clusters=2 noise=0 <- single linkage traces the crescents
DBSCAN clusters=2 noise=6 <- traces crescents AND isolates 6 stray points
On crescent-shaped data, k-means splits the moons the wrong way (it draws a straight boundary through the middle), while single-linkage hierarchical and DBSCAN both follow the true curved shapes. DBSCAN additionally reports the noisy fringe points as outliers. This is the whole story of this chapter in one experiment.
Common Mistakes
1. Forgetting to scale features
All three methods are distance-based. If annual_spend runs 0–100,000 and visits runs 0–30, spend dominates every distance and visits are effectively ignored. Always standardise (StandardScaler) or normalise before clustering, exactly as covered in the Feature Engineering & Scaling chapter.
2. Treating eps as a magic constant
eps is not transferable across datasets — it depends on scale, dimensionality, and density. Copying an eps from a tutorial (including this one) onto your data will usually give one giant cluster or all-noise. Tune it with a k-distance plot every time.
3. Expecting DBSCAN to handle very uneven densities
A single global eps cannot serve a dense cluster and a sparse cluster simultaneously. If your clusters have very different densities, DBSCAN will either merge the sparse one into noise or split the dense one. Reach for HDBSCAN or k-means instead.
4. Ignoring the -1 noise label in DBSCAN
The -1 label is not a cluster — it is "unassigned." A frequent bug is looping over set(labels) and counting -1 as a real cluster, or plotting it with the others as if it were meaningful. Always subtract it out: len(set(labels)) - (1 if -1 in labels else 0).
5. Cutting a dendrogram at an arbitrary height
Do not just eyeball "somewhere in the middle." Cut across the tallest vertical gap that no merge crosses — that is where clusters are most separated. Cutting through a busy region gives unstable, meaningless groupings.
6. Using single linkage without expecting chaining
Single linkage is powerful for elongated shapes, but the same min-distance rule causes chaining: two distinct blobs joined by even a thin trail of points merge into one cluster. If your clusters look suspiciously merged, try Ward or complete linkage.
Practice Exercises
-
Given five 1-D points at positions
2, 5, 9, 10, 25, perform agglomerative clustering by hand using single linkage. Write out the merge order and the distance at each merge. At what cut height do you get exactly 2 clusters? -
You run
AgglomerativeClusteringwithlinkage="ward"and get three roughly equal, compact clusters. You suspect the true clusters are actually two long curved bands. Which linkage would you switch to, and why? -
On a dataset, DBSCAN with
eps=0.3, min_samples=5returns almost all points as noise (-1). List two changes to the parameters that could reduce the amount of noise, and explain the trade-off of each. -
Explain in one or two sentences why DBSCAN can find two interleaving crescent shapes but k-means with
k=2cannot. -
You have 2 million rows and need clusters quickly for a nightly job. Which of the three methods would you rule out on scalability grounds, and which would you try first?
-
Sketch (in words) a k-distance plot and describe how you would read a good value of
epsoff it. What doeskcorrespond to in that plot?
Summary
In this chapter you learned:
- Hierarchical (agglomerative) clustering starts with every point as its own cluster and repeatedly merges the two closest clusters bottom-up, recording the full hierarchy — no need to pre-specify
k. - The linkage criterion defines cluster-to-cluster distance: single (
min, chains and curves), complete (max, compact), average (balanced), and Ward (minimises variance increase, the popular default). - A dendrogram visualises the merges; you choose
kby drawing a horizontal cut across the tallest uncrossed vertical gap. - DBSCAN clusters by density using two parameters —
eps(neighbourhood radius) andmin_samples(points needed to be dense) — classifying points as core, border, or noise (-1). - DBSCAN needs no
k, finds arbitrary shapes, and isolates outliers as noise — but a single globalepsstruggles with clusters of very different densities. - Choose by data: k-means for large, round, known-
kdata; hierarchical to explore unknown structure and pickkafterwards; DBSCAN for non-spherical shapes and outlier-heavy data. - Always scale features first, tune
epswith a k-distance plot, and never treat the-1noise label as a real cluster.
Clustering groups rows; next we turn to the columns. When you have dozens or hundreds of features, distances become unreliable and visualisation impossible — so we compress the feature space while keeping the signal.
Next up: Dimensionality Reduction & PCA — how principal component analysis rotates your data onto a handful of high-variance axes to fight the curse of dimensionality, speed up models, and let you actually plot high-dimensional data.