# IsolationForest in Scikit-learn

An example using IsolationForest for anomaly detection.

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeable shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

[1] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM‘08. Eighth IEEE International Conference on.

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports IsolationForest.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

### Calculations¶

In [3]:
rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# plot the line, the samples, and the nearest vectors to the plane
xx = yy = np.linspace(-5, 5, 50)
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)


### Plot Results¶

In [4]:
def matplotlib_to_plotly(cmap, pl_entries):
h = 1.0/(pl_entries-1)
pl_colorscale = []

for k in range(pl_entries):
C = map(np.uint8, np.array(cmap(k*h)[:3])*255)
pl_colorscale.append([k*h, 'rgb'+str((C[0], C[1], C[2]))])

return pl_colorscale

In [5]:
back = go.Contour(x=xx,
y=yy,
z=Z,
colorscale=matplotlib_to_plotly(plt.cm.Blues_r, len(Z)),
showscale=False,
line=dict(width=0)
)

b1 = go.Scatter(x=X_train[:, 0],
y=X_train[:, 1],
name="training observations",
mode='markers',
marker=dict(color='white', size=7,
line=dict(color='black', width=1))
)
b2 = go.Scatter(x=X_test[:, 0],
y=X_test[:, 1],
name="new regular observations",
mode='markers',
marker=dict(color='green', size=6,
line=dict(color='black', width=1))
)
c = go.Scatter(x=X_outliers[:, 0],
y=X_outliers[:, 1],
name="new abnormal observations",
mode='markers',
marker=dict(color='red', size=6,
line=dict(color='black', width=1))
)

layout = go.Layout(title="IsolationForest",
hovermode='closest')
data = [back, b1, b2, c]

fig = go.Figure(data=data, layout=layout)

In [6]:
py.iplot(fig)

Out[6]:
