Show Sidebar Hide Sidebar

Sparse Inverse Covariance Estimation in Scikit-learn

Using the GraphLasso estimator to learn a covariance and sparse precision from a small number of samples. To estimate a probabilistic model (e.g. a Gaussian model), estimating the precision matrix, that is the inverse covariance matrix, is as important as estimating the covariance matrix. Indeed a Gaussian model is parametrized by the precision matrix.

To be in favorable recovery conditions, we sample the data from a model with a sparse inverse covariance matrix. In addition, we ensure that the data is not too much correlated (limiting the largest coefficient of the precision matrix) and that there a no small coefficients in the precision matrix that cannot be recovered. In addition, with a small number of observations, it is easier to recover a correlation matrix rather than a covariance, thus we scale the time series.

Here, the number of samples is slightly larger than the number of dimensions, thus the empirical covariance is still invertible. However, as the observations are strongly correlated, the empirical covariance matrix is ill-conditioned and as a result its inverse –the empirical precision matrix– is very far from the ground truth. If we use l2 shrinkage, as with the Ledoit-Wolf estimator, as the number of samples is small, we need to shrink a lot. As a result, the Ledoit-Wolf precision is fairly close to the ground truth precision, that is not far from being diagonal, but the off-diagonal structure is lost.

The l1-penalized estimator can recover part of this off-diagonal structure. It learns a sparse precision. It is not able to recover the exact sparsity pattern: it detects too many non-zero coefficients. However, the highest non-zero coefficients of the l1 estimated correspond to the non-zero coefficients in the ground truth. Finally, the coefficients of the l1 precision estimate are biased toward zero: because of the penalty, they are all smaller than the corresponding ground truth value, as can be seen on the figure.

Note that, the color range of the precision matrices is tweaked to improve readability of the figure. The full range of values of the empirical precision is not displayed.

The alpha parameter of the GraphLasso setting the sparsity of the model is set by internal cross-validation in the GraphLassoCV. As can be seen on figure 2, the grid to compute the cross-validation score is iteratively refined in the neighborhood of the maximum.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18'

Imports

This tutorial imports make_sparse_spd_matrix, GraphLassoCV and ledoit_wolf.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools


import numpy as np
from scipy import linalg
from sklearn.datasets import make_sparse_spd_matrix
from sklearn.covariance import GraphLassoCV, ledoit_wolf

Calculations

Generate the data

In [3]:
n_samples = 60
n_features = 20

prng = np.random.RandomState(1)
prec = make_sparse_spd_matrix(n_features, alpha=.98,
                              smallest_coef=.4,
                              largest_coef=.7,
                              random_state=prng)
cov = linalg.inv(prec)
d = np.sqrt(np.diag(cov))
cov /= d
cov /= d[:, np.newaxis]
prec *= d
prec *= d[:, np.newaxis]
X = prng.multivariate_normal(np.zeros(n_features), cov, size=n_samples)
X -= X.mean(axis=0)
X /= X.std(axis=0)

Estimate the covariance

In [4]:
emp_cov = np.dot(X.T, X) / n_samples

model = GraphLassoCV()
model.fit(X)
cov_ = model.covariance_
prec_ = model.precision_

lw_cov_, _ = ledoit_wolf(X)
lw_prec_ = linalg.inv(lw_cov_)

Plot the Covariances

In [5]:
fig = tools.make_subplots(rows=1, cols=4,
                          subplot_titles=('Empirical','Ledoit-Wolf',
                                          'GraphLasso', 'True')
                         )

covs = [('Empirical', emp_cov), ('Ledoit-Wolf', lw_cov_),
        ('GraphLasso', cov_), ('True', cov)]

for i, (name, this_cov) in enumerate(covs):
    trace = go.Heatmap(z=this_cov, 
                       showscale=False,
                       colorscale='RdBu_r')
    fig.append_trace(trace, 1, i+1)
    
for i in map(str,range(1,5)):
        y = 'yaxis'+ i
        x = 'xaxis'+i
        fig['layout'][y].update(autorange='reversed',
                                showticklabels=False, ticks='')
        fig['layout'][x].update(showticklabels=False, ticks='')
        

fig['layout'].update(title='Covariances')
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]  [ (1,4) x4,y4 ]

In [6]:
py.iplot(fig)
Out[6]:

Plot the Precisions

In [7]:
fig = tools.make_subplots(rows=1, cols=4,
                          subplot_titles=('Empirical','Ledoit-Wolf',
                                          'GraphLasso', 'True')
                         )

precs = [('Empirical', linalg.inv(emp_cov)), ('Ledoit-Wolf', lw_prec_),
         ('GraphLasso', prec_), ('True', prec)]
vmax = .9 * prec_.max()

for i, (name, this_prec) in enumerate(precs):
    trace = go.Heatmap(z=np.ma.masked_equal(this_prec, 0),
               showscale=False,
               colorscale='RdBu_r')
    fig.append_trace(trace, 1, i+1)
    
for i in map(str,range(1,5)):
        y = 'yaxis' + i
        x = 'xaxis' + i
        fig['layout'][y].update(autorange='reversed',
                                showticklabels=False, ticks='',
                                zeroline=False, showgrid=False)
        fig['layout'][x].update(showticklabels=False, ticks='',
                                zeroline=False, showgrid=False)
        

fig['layout'].update(title='Precisions',
                     plot_bgcolor='gray',
                                )
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]  [ (1,4) x4,y4 ]

In [8]:
py.iplot(fig)
Out[8]:

Plot the Model Selection Metric

In [9]:
trace = go.Scatter(x=model.cv_alphas_,
                   y=np.mean(model.grid_scores, axis=1),
                   showlegend=False,
                  )

axis_line = go.Scatter(x=[model.alpha_, model.alpha_],
                       y=[-45, -20],
                       showlegend=False,
                       mode='lines',
                       line=dict(color='black',
                                 width=1,
                                 dash='dash'))

data = [trace, axis_line]

layout = go.Layout(title='Model selection',
                   xaxis=dict(title='alpha'),
                   yaxis=dict(title='Cross-validation score'))
fig = go.Figure(data=data, layout=layout)
In [10]:
py.iplot(fig)
Out[10]:

License

Author:

    Gael Varoquaux <gael.varoquaux@inria.fr>

License:

    BSD 3 clause

Copyright:

    INRIA
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.