Show Sidebar Hide Sidebar

# Gaussian Mixture Model Sine Curve in Scikit-learn

This example demonstrates the behavior of Gaussian mixture models fit on data that was not sampled from a mixture of Gaussian random variables. The dataset is formed by 100 points loosely spaced following a noisy sine curve.

There is therefore no ground truth value for the number of Gaussian components. The first model is a classical Gaussian Mixture Model with 10 components fit with the Expectation-Maximization algorithm.

The second model is a Bayesian Gaussian Mixture Model with a Dirichlet process prior fit with variational inference. The low value of the concentration prior makes the model favor a lower number of active components. This models “decides” to focus its modeling power on the big picture of the structure of the dataset: groups of points with alternating directions modeled by non-diagonal covariance matrices. Those alternating directions roughly capture the alternating nature of the original sine signal.

The third model is also a Bayesian Gaussian mixture model with a Dirichlet process prior but this time the value of the concentration prior is higher giving the model more liberty to model the fine-grained structure of the data.

The result is a mixture with a larger number of active components that is similar to the first model where we arbitrarily decided to fix the number of components to 10. Which model is the best is a matter of subjective judgement: do we want to favor models that only capture the big picture to summarize and explain most of the structure of the data while ignoring the details or do we prefer models that closely follow the high density regions of the signal?

The last two panels show how we can sample from the last two models. The resulting samples distributions do not look exactly like the original data distribution. The difference primarily stems from the approximation error we made by using a model that assumes that the data was generated by a finite number of Gaussian components instead of a continuous noisy sine curve.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import itertools
import math
import numpy as np
from scipy import linalg

from sklearn import mixture


### Calculations¶

In [3]:
color_iter = itertools.cycle(['navy', 'cyan', 'cornflowerblue', 'gold',
'darkorange'])

def plot_results(X, Y, means, covariances):
data= []

for i, (mean, covar, color) in enumerate(zip(
means, covariances, color_iter)):
v, w = linalg.eigh(covar)
v = 2. * np.sqrt(2.) * np.sqrt(v)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant
# components.
if not np.any(Y == i):
continue
trace = go.Scatter(x=X[Y == i, 0],
y=X[Y == i, 1],
mode='markers', marker=dict(color=color,
line=dict(color='black', width=1)))
data.append(trace)

# Plot an ellipse to show the Gaussian component
a =  v[0]
b =  v[1]
x_origin = mean[0]
y_origin = mean[1]
x_ = [ ]
y_ = [ ]
angle = np.arctan(u[1] / u[0])
angle = 180. * angle / np.pi
for t in range(0, 360,10):
x_.append(x)
y_.append(y)

elle = go.Scatter(x=x_ , y=y_, mode='lines',
showlegend=False,
line=dict(color=color, width=1))
data.append(elle)

return data

def plot_samples(X, Y, n_components):
data = []
for i, color in zip(range(n_components), color_iter):
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant
# components.
if not np.any(Y == i):
continue
trace = go.Scatter(x=X[Y == i, 0],
y=X[Y == i, 1],
mode='markers',marker=dict(color=color,
line=dict(color='black', width=1)))
data.append(trace)
return data


### Plot Results¶

In [4]:
titles=('Expectation-maximization',
"Bayesian Gaussian mixture models with a Dirichlet process prior "
"for gamma=0.01.",
"Gaussian mixture with a Dirichlet process prior "
r"for gamma=0.01 sampled with 2000 samples.",
"Bayesian Gaussian mixture models with a Dirichlet process prior "
r"for gamma_0=100",
"Gaussian mixture with a Dirichlet process prior "
r"for gamma=100 sampled with 2000 samples.")

fig = tools.make_subplots(rows=5, cols=1,
print_grid=False,
subplot_titles=titles)

In [5]:
# Parameters
n_samples = 100

# Generate random sample following a sine curve
np.random.seed(0)
X = np.zeros((n_samples, 2))
step = 4. * np.pi / n_samples

for i in range(X.shape[0]):
x = i * step - 6.
X[i, 0] = x + np.random.normal(0, 0.1)
X[i, 1] = 3. * (np.sin(x) + np.random.normal(0, .2))


Expectation-maximization

In [6]:
# Fit a Gaussian mixture with EM using ten components
gmm = mixture.GaussianMixture(n_components=10, covariance_type='full',
max_iter=100).fit(X)
trace = plot_results(X, gmm.predict(X), gmm.means_, gmm.covariances_)

for i in range(0, len(trace)):
fig.append_trace(trace[i], 1, 1)


Bayesian Gaussian mixture models with a Dirichlet process prior for gamma=0.01.

In [7]:
dpgmm = mixture.BayesianGaussianMixture(
n_components=10, covariance_type='full', weight_concentration_prior=1e-2,
weight_concentration_prior_type='dirichlet_process',
mean_precision_prior=1e-2, covariance_prior=1e0 * np.eye(2),
init_params="random", max_iter=100, random_state=2).fit(X)
trace = plot_results(X, dpgmm.predict(X), dpgmm.means_, dpgmm.covariances_)
for i in range(0, len(trace)):
fig.append_trace(trace[i], 2, 1)


Gaussian mixture with a Dirichlet process prior for gamma=0.01 sampled with 2000 samples.

In [8]:
X_s, y_s = dpgmm.sample(n_samples=2000)
trace = plot_samples(X_s, y_s, dpgmm.n_components)
for i in range(0, len(trace)):
fig.append_trace(trace[i], 3, 1)


Bayesian Gaussian mixture models with a Dirichlet process prior for gamma_0=100

In [9]:
dpgmm = mixture.BayesianGaussianMixture(
n_components=10, covariance_type='full', weight_concentration_prior=1e+2,
weight_concentration_prior_type='dirichlet_process',
mean_precision_prior=1e-2, covariance_prior=1e0 * np.eye(2),
init_params="kmeans", max_iter=100, random_state=2).fit(X)
trace = plot_results(X, dpgmm.predict(X), dpgmm.means_, dpgmm.covariances_)

for i in range(0, len(trace)):
fig.append_trace(trace[i], 4, 1)


Gaussian mixture with a Dirichlet process prior for gamma=100 sampled with 2000 samples.

In [10]:
X_s, y_s = dpgmm.sample(n_samples=2000)
trace = plot_samples(X_s, y_s, dpgmm.n_components)

for i in range(0, len(trace)):
fig.append_trace(trace[i], 5, 1)

In [11]:
for i in map(str,range(1,6)):
y = 'yaxis'+i
x = 'xaxis'+i
fig['layout'][y].update(showticklabels=False, ticks='',
zeroline=False, showgrid=False)
fig['layout'][x].update(showticklabels=False, ticks='',
zeroline=False, showgrid=False)

fig['layout'].update(height=1200, hovermode='closest',
showlegend=False)

py.iplot(fig)

Out[11]:
Still need help?