Show Sidebar Hide Sidebar

# Outlier Detection with Several Methods in Scikit-learn

When the amount of contamination is known, this example illustrates three different ways of performing Novelty and Outlier Detection:

• based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case.

• using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters;

• using the Isolation Forest algorithm, which is based on random forests and hence more adapted to large-dimensional settings, even if it performs quite well in the examples below.

The ground truth about inliers and outliers is given by the points colors while the orange-filled area indicates which points are reported as inliers by each method.

Here, we assume that we know the fraction of outliers in the datasets. Thus rather than using the â€˜predictâ€™ method of the objects, we set the threshold on the decision_function to separate out the corresponding fraction.

#### New to Plotly?¶

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorial imports EllipticEnvelope and IsolationForest.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager

from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
rng = np.random.RandomState(42)

# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]

# define two outlier detection tools to be compared
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
"Isolation Forest": IsolationForest(max_samples=n_samples,
contamination=outliers_fraction,
random_state=rng)}

# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = -1


### Plot Results¶

In [4]:
fig = tools.make_subplots(rows=3, cols=3,
print_grid=False,
subplot_titles=('1. Isolation Forest(Errors 0)',
'2. One-Class SVM (Errors 8)',
'3. Robust Covariance (Errors 0)',

'1. Isolation Forest(Errors 2)',
'2. One-Class SVM (Errors 10)',
'3. Robust Covariance (Errors 8)',

'1. Isolation Forest(Errors 6)',
'2. One-Class SVM (Errors 14)',
'3. Robust Covariance (Errors 14)')
)

def matplotlib_to_plotly(cmap, pl_entries):
h = 1.0/(pl_entries-1)
pl_colorscale = []

for k in range(pl_entries):
C = map(np.uint8, np.array(cmap(k*h)[:3])*255)
pl_colorscale.append([k*h, 'rgb'+str((C[0], C[1], C[2]))])

return pl_colorscale

In [5]:
row=1

#Fit the problem with varying cluster separation
for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]

for i, (clf_name, clf) in enumerate(classifiers.items()):

# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X)
threshold = stats.scoreatpercentile(scores_pred,
100 * outliers_fraction)
y_pred = clf.predict(X)
n_errors = (y_pred != ground_truth).sum()

# plot the levels lines and the points
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

back = go.Contour(x=xx[0],
y=xx[0],
z=Z,
contours=dict(showlines=False,),
showscale=False,
colorscale = matplotlib_to_plotly(plt.cm.Blues, 10))

b = go.Scatter(x=X[:-n_outliers, 0],
y=X[:-n_outliers, 1],
showlegend=False,
name='True Intliers',
mode='markers',
marker=dict(color='white',line=dict(color='black', width=1))
)

c = go.Scatter(x=X[-n_outliers:, 0],
y=X[-n_outliers:, 1],
showlegend=False,
name='True Outliers',
mode='markers',
marker=dict(color='black')
)
fig.append_trace(back, row, i+1)
fig.append_trace(b, row, i+1)
fig.append_trace(c, row, i+1)

row+=1

fig['layout'].update(height=900,
hovermode='closest')

for i in map(str, range(1,10)):
x = 'xaxis' + i
y = 'yaxis' + i
fig['layout'][x].update(showticklabels=False, ticks='')
fig['layout'][y].update(showticklabels=False, ticks='')

In [6]:
py.iplot(fig)

Out[6]:
Still need help?
##### Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.