Show Sidebar Hide Sidebar

IsolationForest in Scikit-learn

An example using IsolationForest for anomaly detection.

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeable shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

[1] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM‘08. Eighth IEEE International Conference on.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

This tutorial imports IsolationForest.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
Automatically created module for IPython interactive environment

Calculations

In [3]:
rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# plot the line, the samples, and the nearest vectors to the plane
xx = yy = np.linspace(-5, 5, 50)
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

Plot Results

In [4]:
def matplotlib_to_plotly(cmap, pl_entries):
    h = 1.0/(pl_entries-1)
    pl_colorscale = []
    
    for k in range(pl_entries):
        C = map(np.uint8, np.array(cmap(k*h)[:3])*255)
        pl_colorscale.append([k*h, 'rgb'+str((C[0], C[1], C[2]))])
        
    return pl_colorscale
In [5]:
back = go.Contour(x=xx, 
                  y=yy, 
                  z=Z, 
                  colorscale=matplotlib_to_plotly(plt.cm.Blues_r, len(Z)),
                  showscale=False,
                  line=dict(width=0)
                 )

b1 = go.Scatter(x=X_train[:, 0],
                y=X_train[:, 1],
                name="training observations",
                mode='markers',
                marker=dict(color='white', size=7,
                            line=dict(color='black', width=1))
               )
b2 = go.Scatter(x=X_test[:, 0], 
                y=X_test[:, 1], 
                name="new regular observations",
                mode='markers',
                marker=dict(color='green', size=6,
                            line=dict(color='black', width=1))
               )
c = go.Scatter(x=X_outliers[:, 0], 
               y=X_outliers[:, 1],
               name="new abnormal observations",
               mode='markers',
               marker=dict(color='red', size=6,
                           line=dict(color='black', width=1))
              )

layout = go.Layout(title="IsolationForest",
                   hovermode='closest')
data = [back, b1, b2, c]

fig = go.Figure(data=data, layout=layout)
In [6]:
py.iplot(fig)
Out[6]:
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.