Show Sidebar Hide Sidebar

Agglomerative Clustering With Different Metrics in Scikit-learn

Demonstrates the effect of different metrics on the hierarchical clustering.

The example is engineered to show the effect of the choice of different metrics. It is applied to waveforms, which can be seen as high-dimensional vector. Indeed, the difference between metrics is usually more pronounced in high dimension (in particular for euclidean and cityblock).

We generate data from three groups of waveforms. Two of the waveforms (waveform 1 and waveform 2) are proportional one to the other. The cosine distance is invariant to a scaling of the data, as a result, it cannot distinguish these two waveforms. Thus even with no noise, clustering using this distance will not separate out waveform 1 and 2.

We add observation noise to these waveforms. We generate very sparse noise: only 6% of the time points contain noise. As a result, the l1 norm of this noise (ie “cityblock” distance) is much smaller than it’s l2 norm (“euclidean” distance). This can be seen on the inter-class distance matrices: the values on the diagonal, that characterize the spread of the class, are much bigger for the Euclidean distance than for the cityblock distance.

When we apply clustering to the data, we find that the clustering reflects what was in the distance matrices. Indeed, for the Euclidean distance, the classes are ill-separated because of the noise, and thus the clustering does not separate the waveforms. For the cityblock distance, the separation is good and the waveform classes are recovered. Finally, the cosine distance does not separate at all waveform 1 and 2, thus the clustering puts them in the same cluster.

New to Plotly?¶

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

Imports¶

This tutorial imports AgglomerativeClustering anf pairwise_distances.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances


Calculations¶

In [3]:
np.random.seed(0)

# Generate waveform data
n_features = 2000
t = np.pi * np.linspace(0, 1, n_features)

def sqr(x):
return np.sign(np.cos(x))

X = list()
y = list()
for i, (phi, a) in enumerate([(.5, .15), (.5, .6), (.3, .2)]):
for _ in range(30):
phase_noise = .01 * np.random.normal()
amplitude_noise = .04 * np.random.normal()
additional_noise = 1 - 2 * np.random.rand(n_features)
# Make the noise sparse

X.append(12 * ((a + amplitude_noise)
* (sqr(6 * (t + phi + phase_noise)))
y.append(i)

X = np.array(X)
y = np.array(y)

n_clusters = 3

labels = ('Waveform 1', 'Waveform 2', 'Waveform 3')


Plot Ground Truth¶

In [4]:
ground_truth = []
c=['red','green','blue']
for l, n in zip(range(n_clusters), labels):
for i in range(len(X[y == l])):
if i==1:
legend = True
else:
legend = False
lines = go.Scatter(y=X[y == l][i], name=n,
showlegend=legend,
line=dict(color=c[l],
width=0.3))
ground_truth.append(lines)

layout = go.Layout(title='Ground Truth',
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=ground_truth, layout=layout)

py.iplot(fig)

Out[4]:

Plot the Distances¶

In [5]:
def matplotlib_to_plotly(cmap, pl_entries):
h = 1.0/(pl_entries-1)
pl_colorscale = []

for k in range(pl_entries):
C = map(np.uint8, np.array(cmap(k*h)[:3])*255)
pl_colorscale.append([k*h, 'rgb'+str((C[0], C[1], C[2]))])

return pl_colorscale

annotation = [[], [], []]
distance = [[], [], []]

for index, metric in enumerate(["cosine", "euclidean", "cityblock"]):
avg_dist = np.zeros((n_clusters, n_clusters))
for i in range(n_clusters):
for j in range(n_clusters):
avg_dist[i, j] = pairwise_distances(X[y == i], X[y == j],
metric=metric).mean()
avg_dist /= avg_dist.max()

for i in range(n_clusters):
for j in range(n_clusters):
annotation_ = dict(x=i, y=j,
showarrow=False,
text='%5.3f' % avg_dist[i, j])
annotation[index].append(annotation_)

distance_ = go.Heatmap(z=avg_dist,
showscale=True,
colorscale=matplotlib_to_plotly(plt.cm.gnuplot2,
len(avg_dist)))
distance[index].append(distance_)



Interclass cosine distances

In [6]:
layout = go.Layout(title='Interclass cosine distances',
annotations=annotation[0],
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=distance[0], layout=layout)
py.iplot(fig)

Out[6]:

Interclass euclidean distances

In [7]:
layout = go.Layout(title='Interclass euclidean distances',
annotations=annotation[1],
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=distance[1], layout=layout)
py.iplot(fig)

Out[7]:

Interclass cityblock distances

In [8]:
layout = go.Layout(title='Interclass cityblock distances',
annotations=annotation[2],
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=distance[2], layout=layout)
py.iplot(fig)

Out[8]:

Plot Clustering Results¶

In [9]:
cluster=[[],[],[]]

for index, metric in enumerate(["cosine", "euclidean", "cityblock"]):
model = AgglomerativeClustering(n_clusters=n_clusters,
model.fit(X)
for l, c in zip(np.arange(model.n_clusters), ['red', 'green', 'blue', 'black']):
for i in range(len(X[model.labels_ == l])):
lines = go.Scatter(y=X[model.labels_ == l][i],
showlegend=legend,
name=metric,
line=dict(color=c,
width=0.3))
cluster[index].append(lines)


AgglomerativeClustering (affinity=cosine):

In [10]:
layout = go.Layout(title='AgglomerativeClustering (affinity=cosine)',
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=cluster[0], layout=layout)
py.iplot(fig)

Out[10]:

AgglomerativeClustering (affinity=euclidean):

In [11]:
layout = go.Layout(title='AgglomerativeClustering (affinity=euclidean)',
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=cluster[1], layout=layout)

py.iplot(fig)

Out[11]:

AgglomerativeClustering (affinity=cityblock):

In [12]:
layout = go.Layout(title='AgglomerativeClustering (affinity=cityblock)',
xaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False),
yaxis=dict(zeroline=False, ticks='',
showticklabels=False, showgrid=False))

fig = go.Figure(data=cluster[2], layout=layout)
py.iplot(fig)

Out[12]:

Author:

    Gael Varoquaux



    BSD 3-Clause or CC-0