Show Sidebar Hide Sidebar

# Outlier Detection with Several Methods in Scikit-learn

When the amount of contamination is known, this example illustrates three different ways of performing Novelty and Outlier Detection:

• based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case.

• using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters;

• using the Isolation Forest algorithm, which is based on random forests and hence more adapted to large-dimensional settings, even if it performs quite well in the examples below.

The ground truth about inliers and outliers is given by the points colors while the orange-filled area indicates which points are reported as inliers by each method.

Here, we assume that we know the fraction of outliers in the datasets. Thus rather than using the â€˜predictâ€™ method of the objects, we set the threshold on the decision_function to separate out the corresponding fraction.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorial imports EllipticEnvelope and IsolationForest.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager

from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
rng = np.random.RandomState(42)

# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]

# define two outlier detection tools to be compared
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
"Isolation Forest": IsolationForest(max_samples=n_samples,
contamination=outliers_fraction,
random_state=rng)}

# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = -1


### Plot Results¶

In [4]:
fig = tools.make_subplots(rows=3, cols=3,
print_grid=False,
subplot_titles=('1. Isolation Forest(Errors 0)',
'2. One-Class SVM (Errors 8)',
'3. Robust Covariance (Errors 0)',

'1. Isolation Forest(Errors 2)',
'2. One-Class SVM (Errors 10)',
'3. Robust Covariance (Errors 8)',

'1. Isolation Forest(Errors 6)',
'2. One-Class SVM (Errors 14)',
'3. Robust Covariance (Errors 14)')
)

def matplotlib_to_plotly(cmap, pl_entries):
h = 1.0/(pl_entries-1)
pl_colorscale = []

for k in range(pl_entries):
C = map(np.uint8, np.array(cmap(k*h)[:3])*255)
pl_colorscale.append([k*h, 'rgb'+str((C[0], C[1], C[2]))])

return pl_colorscale

In [5]:
row=1

#Fit the problem with varying cluster separation
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]

for i, (clf_name, clf) in enumerate(classifiers.items()):

# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X)
threshold = stats.scoreatpercentile(scores_pred,
100 * outliers_fraction)
y_pred = clf.predict(X)
n_errors = (y_pred != ground_truth).sum()

# plot the levels lines and the points
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

back = go.Contour(x=xx[0],
y=xx[0],
z=Z,
contours=dict(showlines=False,),
showscale=False,
colorscale = matplotlib_to_plotly(plt.cm.Blues, 10))

b = go.Scatter(x=X[:-n_outliers, 0],
y=X[:-n_outliers, 1],
showlegend=False,
name='True Intliers',
mode='markers',
marker=dict(color='white',line=dict(color='black', width=1))
)

c = go.Scatter(x=X[-n_outliers:, 0],
y=X[-n_outliers:, 1],
showlegend=False,
name='True Outliers',
mode='markers',
marker=dict(color='black')
)
fig.append_trace(back, row, i+1)
fig.append_trace(b, row, i+1)
fig.append_trace(c, row, i+1)

row+=1

fig['layout'].update(height=900,
hovermode='closest')

for i in map(str, range(1,10)):
x = 'xaxis' + i
y = 'yaxis' + i
fig['layout'][x].update(showticklabels=False, ticks='')
fig['layout'][y].update(showticklabels=False, ticks='')

In [6]:
py.iplot(fig)

Out[6]:
Still need help?