Show Sidebar Hide Sidebar

# Robust vs Empirical Covariance Estimate in Scikit-learn

The usual covariance maximum likelihood estimate is very sensitive to the presence of outliers in the data set. In such a case, it would be better to use a robust estimator of covariance to guarantee that the estimation is resistant to “erroneous” observations in the data set.

### Minimum Covariance Determinant Estimator¶

The Minimum Covariance Determinant estimator is a robust, high-breakdown point (i.e. it can be used to estimate the covariance matrix of highly contaminated datasets, up to (n_samples - n_features-1)/2 outliers) estimator of covariance. The idea is to find (n_samples + n_features+1)/2 observations whose empirical covariance has the smallest determinant, yielding a “pure” subset of observations from which to compute standards estimates of location and covariance. After a correction step aiming at compensating the fact that the estimates were learned from only a portion of the initial data, we end up with robust estimates of the data set location and covariance.

The Minimum Covariance Determinant estimator (MCD) has been introduced by P.J.Rousseuw.

### Evaluation¶

In this example, we compare the estimation errors that are made when using various types of location and covariance estimates on contaminated Gaussian distributed data sets: The mean and the empirical covariance of the full dataset, which break down as soon as there are outliers in the data set

The robust MCD, that has a low error provided n_samples > 5n_features The mean and the empirical covariance of the observations that are known to be good ones. This can be considered as a “perfect” MCD estimation, so one can trust our implementation by comparing to this case.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorial imports EmpiricalCovariance and MinCovDet.

In [1]:
import plotly.plotly as py
import plotly.graph_objs as go

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager

from sklearn.covariance import EmpiricalCovariance, MinCovDet

Automatically created module for IPython interactive environment


### Calculations¶

In [2]:
# example settings
n_samples = 80
n_features = 5
repeat = 10

range_n_outliers = np.concatenate(
(np.linspace(0, n_samples / 8, 5),
np.linspace(n_samples / 8, n_samples / 2, 5)[1:-1]))

# definition of arrays to store results
err_loc_mcd = np.zeros((range_n_outliers.size, repeat))
err_cov_mcd = np.zeros((range_n_outliers.size, repeat))
err_loc_emp_full = np.zeros((range_n_outliers.size, repeat))
err_cov_emp_full = np.zeros((range_n_outliers.size, repeat))
err_loc_emp_pure = np.zeros((range_n_outliers.size, repeat))
err_cov_emp_pure = np.zeros((range_n_outliers.size, repeat))

# computation
for i, n_outliers in enumerate(range_n_outliers):
for j in range(repeat):

rng = np.random.RandomState(i * j)

# generate data
X = rng.randn(n_samples, n_features)
outliers_index = rng.permutation(n_samples)[:n_outliers]
outliers_offset = 10. * \
(np.random.randint(2, size=(n_outliers, n_features)) - 0.5)
X[outliers_index] += outliers_offset

# fit a Minimum Covariance Determinant (MCD) robust estimator to data
mcd = MinCovDet().fit(X)
# compare raw robust estimates with the true location and covariance
err_loc_mcd[i, j] = np.sum(mcd.location_ ** 2)
err_cov_mcd[i, j] = mcd.error_norm(np.eye(n_features))

# compare estimators learned from the full data set with true
# parameters
err_loc_emp_full[i, j] = np.sum(X.mean(0) ** 2)
err_cov_emp_full[i, j] = EmpiricalCovariance().fit(X).error_norm(
np.eye(n_features))

# compare with an empirical covariance learned from a pure data set
# (i.e. "perfect" mcd)
pure_location = pure_X.mean(0)
pure_emp_cov = EmpiricalCovariance().fit(pure_X)
err_loc_emp_pure[i, j] = np.sum(pure_location ** 2)
err_cov_emp_pure[i, j] = pure_emp_cov.error_norm(np.eye(n_features))


### Plot Results¶

Influence of outliers on the location estimation

In [3]:
font_prop = matplotlib.font_manager.FontProperties(size=11)

robust_location = go.Scatter(x=range_n_outliers,
y=err_loc_mcd.mean(1),
error_y=dict(visible=True,
arrayminus=err_loc_mcd.std(1) / np.sqrt(repeat)),
name="Robust location",
mode='lines',
line=dict(color='magenta')
)

full_data_set_mean = go.Scatter(x=range_n_outliers,
y=err_loc_emp_full.mean(1),
error_y=dict(visible=True,
arrayminus=err_loc_emp_full.std(1) / np.sqrt(repeat)),
mode='lines',
name="Full data set mean",
line=dict(color='green')
)
pure_data_set_mean = go.Scatter(x=range_n_outliers,
y=err_loc_emp_pure.mean(1),
error_y=dict(visible=True,
arrayminus=err_loc_emp_pure.std(1) / np.sqrt(repeat)),
mode='lines',
name="Pure data set mean",
line=dict(color='black')
)

layout = go.Layout(title='Influence of outliers on the location estimation',
yaxis=dict(title="Error"),
xaxis=dict(title='Amount of contamination (%)') )

fig = go.Figure(data= [robust_location, pure_data_set_mean, full_data_set_mean],
layout=layout)

In [4]:
py.iplot(fig)

Out[4]:

Influence of outliers on the covariance estimation

In [5]:
x_size = range_n_outliers.size

robust_covariance = go.Scatter(x=range_n_outliers,
y=err_cov_mcd.mean(1),
error_y=dict(visible=True,
arrayminus=err_cov_mcd.std(1)),
name="Robust covariance (mcd)",
mode='lines',
line=dict(color='magenta')
)
full_data_set1 = go.Scatter(x=range_n_outliers[:(x_size / 5 + 1)],
y=err_cov_emp_full.mean(1)[:(x_size / 5 + 1)],
error_y=dict(visible=True,
arrayminus=err_cov_emp_full.std(1)[:(x_size / 5 + 1)]),
name="Full data set empirical covariance",
mode='lines',
line=dict(color='green')
)

full_data_set2 = go.Scatter(x=range_n_outliers[(x_size / 5):(x_size / 2 - 1)],
y=err_cov_emp_full.mean(1)[(x_size / 5):(x_size / 2 - 1)],
name="Full data set empirical covariance",
showlegend=False,
mode='lines',
line=dict(color='green',
dash='dash')
)
pure_data_set = go.Scatter(x=range_n_outliers,
y=err_cov_emp_pure.mean(1),
error_y=dict(visible=True,
arrayminus=err_cov_emp_pure.std(1)),
mode='lines',
name="Pure data set empirical covariance",
line=dict(color='black')
)

layout = go.Layout(title='Influence of outliers on the covariance estimation',
yaxis=dict(title="RMSE"),
xaxis=dict(title='Amount of contamination (%)')
)

fig = go.Figure(data= [robust_covariance, full_data_set1, full_data_set2, pure_data_set],
layout=layout)

In [6]:
py.iplot(fig)

Out[6]:

### References¶

1. P. J. Rousseeuw. Least median of squares regression. Journal of American Statistical Ass., 79:871, 1984.

2. Johanna Hardin, David M Rocke. The distribution of robust distances. Journal of Computational and Graphical Statistics. December 1, 2005, 14(4): 928-946.

3. Zoubir A., Koivunen V., Chakhchoukh Y. and Muma M. (2012). Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts. IEEE Signal Processing Magazine 29(4), 61-80.

Still need help?