Show Sidebar Hide Sidebar

# Outlier Detection in Scikit-learn

This example illustrates the need for robust covariance estimation on a real data set. It is useful both for outlier detection and for a better understanding of the data structure.

We selected two sets of two variables from the Boston housing data set as an illustration of what kind of analysis can be done with several outlier detection tools. For the purpose of visualization, we are working with two-dimensional examples, but one should be aware that things are not so trivial in high-dimension, as it will be pointed out.

In both examples below, the main result is that the empirical covariance estimate, as a non-robust one, is highly influenced by the heterogeneous structure of the observations. Although the robust covariance estimate is able to focus on the main mode of the data distribution, it sticks to the assumption that the data should be Gaussian distributed, yielding some biased estimation of the data structure, but yet accurate to some extent. The One-Class SVM does not assume any parametric form of the data distribution and can therefore model the complex shape of the data much better.

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorial imports EllipticEnvelope, OneClassSVM and load_boston

In [2]:
from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
import matplotlib.font_manager
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM


### First Example¶

The first example illustrates how robust covariance estimation can help concentrating on a relevant cluster when another one exists. Here, many observations are confounded into one and break down the empirical covariance estimation. Of course, some screening tools would have pointed out the presence of two clusters (Support Vector Machines, Gaussian Mixture Models, univariate outlier detection, ...). But had it been a high-dimensional example, none of these could be applied that easily.

### Second Example¶

The second example shows the ability of the Minimum Covariance Determinant robust estimator of covariance to concentrate on the main mode of the data distribution: the location seems to be well estimated, although the covariance is hard to estimate due to the banana-shaped distribution. Anyway, we can get rid of some outlying observations. The One-Class SVM is able to capture the real data structure, but the difficulty is to adjust its kernel bandwidth parameter so as to obtain a good compromise between the shape of the data scatter matrix and the risk of over-fitting the data.

### Plotting and Calculations¶

In [4]:
print(__doc__)

X1 = load_boston()['data'][:, [8, 10]]  # two clusters
X2 = load_boston()['data'][:, [5, 12]]  # "banana"-shaped

fig = tools.make_subplots(rows=1, cols=2,
subplot_titles=('Outlier detection on a real data set (boston housing)',
'Outlier detection on a real data set (boston housing)' ))

# Define "classifiers" to be used
classifiers = {
"Empirical Covariance": EllipticEnvelope(support_fraction=1.,
contamination=0.261),
"Robust Covariance (Minimum Covariance Determinant)":
EllipticEnvelope(contamination=0.261),
"OCSVM": OneClassSVM(nu=0.261, gamma=0.05)}
colors = ['purple', 'green', 'blue']

contour_1=[]
contour_2=[]
# Learn a frontier for outlier detection with several classifiers
xx1, yy1 = np.meshgrid(np.linspace(-8, 28,500),
np.linspace(3, 40, 500))
xx2, yy2 = np.meshgrid(np.linspace(3, 10, 500), np.linspace(-5, 45, 500))

m1=[]
m2=[]
for l in range(0,len(yy1)):
m1.append(yy1[l][0])

for l in range(0,len(yy1)):
m2.append(yy2[l][0])

scatter2 = go.Scatter(x = X2[:, 0], y= X2[:, 1], mode = "markers",
marker=dict(color='black'),
showlegend=False)
fig.append_trace(scatter2, 1, 2)

for i, (clf_name, clf) in enumerate(classifiers.items()):
colorscale=[[0, colors[i]],
[0.5,colors[i]],
[1,colors[i]]]
clf.fit(X1)
Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])

Z1 = Z1.reshape(xx1.shape)
contour1 = go.Contour(z=Z1, x=xx1[0],y=m1,ncontours=1,
showscale=False,
line=dict(width=3),
contours=dict(coloring='lines'),
colorscale=colorscale)
fig.append_trace(contour1, 1, 1)

clf.fit(X2)
Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
Z2 = Z2.reshape(xx2.shape)

contour2 = go.Contour(z=Z2, x=xx2[0],y=m2,
ncontours=1,
showscale=False,
line=dict(width=2),
contours=dict(coloring='lines'),
colorscale=colorscale )
fig.append_trace(contour2, 1, 2)

scatter1 = go.Scatter(x=X1[:, 0], y=X1[:, 1], mode="markers",
marker=dict(color='black'),showlegend=False)
fig.append_trace(scatter1, 1, 1)

fig['layout']['xaxis1'].update(title='pupil-teacher ratio by town',
zeroline=False)

fig['layout']['xaxis2'].update(title='average number of rooms per dwelling',)
fig['layout']['yaxis2'].update(zeroline=False,)

fig['layout'].update(annotations=[
dict(
x=23.81,
y=20.20,
xref='xaxis1',
yref='yaxis1',
text='Several Confounded Points',
bordercolor='black',
borderwidth=1,
bgcolor='rgb(211,211,211)',
showarrow=True,)]
)

Automatically created module for IPython interactive environment
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]


In [5]:
py.iplot(fig, filename="outlier")

Out[5]:

    Virgile Fritsch <virgile.fritsch@inria.fr>

    BSD 3 clause