Show Sidebar Hide Sidebar

GMM Covariances in Scikit-learn

Demonstration of several covariances types for Gaussian mixture models.

See Gaussian mixture models for more information on the estimator.

Although GMM are often used for clustering, we can compare the obtained clusters with the actual classes from the dataset. We initialize the means of the Gaussians with the means of the classes from the training set to make this comparison valid.

We plot predicted labels on both training and held out test data using a variety of GMM covariance types on the iris dataset. We compare GMMs with spherical, diagonal, full, and tied covariance matrices in increasing order of performance. Although one would expect full covariance to perform best in general, it is prone to overfitting on small datasets and does not generalize well to held out test data. On the plots, train data is shown as dots, while test data is shown as crosses. The iris dataset is four-dimensional. Only the first two dimensions are shown here, and thus some points are separated in other dimensions.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

This tutorial imports GaussianMixture and StratifiedKFold.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
import math
from sklearn import datasets
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold

Calculations

In [3]:
colors = ['navy', 'turquoise', 'darkorange']

def make_ellipses(gmm):
    data_ = [ ]
    for n, color in enumerate(colors):
        if gmm.covariance_type == 'full':
            covariances = gmm.covariances_[n][:2, :2]
        elif gmm.covariance_type == 'tied':
            covariances = gmm.covariances_[:2, :2]
        elif gmm.covariance_type == 'diag':
            covariances = np.diag(gmm.covariances_[n][:2])
        elif gmm.covariance_type == 'spherical':
            covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]
        v, w = np.linalg.eigh(covariances)
        u = w[0] / np.linalg.norm(w[0])
        v = 2. * np.sqrt(2.) * np.sqrt(v)
        # Plot ellipse
        
        a =  v[1]
        b =  v[0]
        x_origin = gmm.means_[n, :2][0]
        y_origin = gmm.means_[n, :2][1]
        x_ = [ ]
        y_ = [ ]

        for t in range(0,361,10):
            x = a*(math.cos(math.radians(t))) + x_origin
            x_.append(x)
            y = b*(math.sin(math.radians(t))) + y_origin
            y_.append(y)

        elle = go.Scatter(x=x_ , y=y_, mode='lines',
                          showlegend=False,
                          line=dict(color=color, width=2))
        data_.append(elle)
    
    return data_

iris = datasets.load_iris()

# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
skf = StratifiedKFold(n_splits=4)
# Only take the first fold.
train_index, test_index = next(iter(skf.split(iris.data, iris.target)))


X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]

n_classes = len(np.unique(y_train))
titles = []
data_ = []

Plot Results

In [4]:
# Try GMMs using different types of covariances.
estimators = dict((cov_type, GaussianMixture(n_components=n_classes,
                   covariance_type=cov_type, max_iter=20, random_state=0))
                  for cov_type in ['spherical', 'diag', 'tied', 'full'])

n_estimators = len(estimators)

for index, (name, estimator) in enumerate(estimators.items()):
    # Since we have class labels for the training data, we can
    # initialize the GMM parameters in a supervised manner.
    estimator.means_init = np.array([X_train[y_train == i].mean(axis=0)
                                    for i in range(n_classes)])

    # Train the other parameters using the EM algorithm.
    estimator.fit(X_train)
    data_.append([ ])
    data_[index] = data_[index] + make_ellipses(estimator)
    if(index==0):
        leg=True
    else:
        leg=False
        
    for n, color in enumerate(colors):
        data = iris.data[iris.target == n]
        trace = go.Scatter(x=data[:, 0], y=data[:, 1], 
                           mode='markers',
                           marker=dict(color=color),
                           showlegend=leg,
                           name=iris.target_names[n])
        data_[index].append(trace)
        
    # Plot the test data with circles
    for n, color in enumerate(colors):
        data = X_test[y_test == n]
        trace = go.Scatter(x=data[:, 0], y=data[:, 1], 
                           mode='markers',
                           showlegend=False,
                           marker=dict(color='white', size=14,
                                       line=dict(color=color, width=1)))
        data_[index].append(trace)

    y_train_pred = estimator.predict(X_train)
    train_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100
    

    y_test_pred = estimator.predict(X_test)
    test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100
    

    titles.append(name+
                  '<br> Train accuracy: %.1f' % train_accuracy+
                  '<br> Test accuracy: %.1f' % test_accuracy)
In [5]:
fig = tools.make_subplots(rows=2, cols=2, 
                          print_grid=False,
                          subplot_titles=tuple(titles[0: 4]))
fig['layout'].update(height=900, hovermode='closest')

for i in range(0, len(data_)):
    for j in range(0,len(data_[i])):        
        fig.append_trace(data_[i][j], i/2+1, i%2+1)
In [6]:
py.iplot(fig)
Out[6]:

License

Author:

    Ron Weiss <ronweiss@gmail.com>, Gael Varoquaux

License:

    BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.