Show Sidebar Hide Sidebar

Hashing Feature Transformation using Totally Random Trees in Scikit-learn

RandomTreesEmbedding provides a way to map data to a very high-dimensional, sparse representation, which might be beneficial for classification. The mapping is completely unsupervised and very efficient.

This example visualizes the partitions given by several trees and shows how the transformation can also be used for non-linear dimensionality reduction or non-linear classification.

Points that are neighboring often share the same leaf of a tree and therefore share large parts of their hashed representation. This allows to separate two concentric circles simply based on the principal components of the transformed data with truncated SVD.

In high-dimensional spaces, linear classifiers often achieve excellent accuracy. For sparse binary data, BernoulliNB is particularly well-suited. The bottom row compares the decision boundary obtained by BernoulliNB in the transformed space with an ExtraTreesClassifier forests learned on the original data.

New to Plotly?¶

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

Imports¶

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_circles
from sklearn.ensemble import RandomTreesEmbedding, ExtraTreesClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.naive_bayes import BernoulliNB


Calculations¶

In [3]:
# make a synthetic dataset
X, y = make_circles(factor=0.5, random_state=0, noise=0.05)

# use RandomTreesEmbedding to transform data
hasher = RandomTreesEmbedding(n_estimators=10, random_state=0, max_depth=3)
X_transformed = hasher.fit_transform(X)

# Visualize result after dimensionality reduction using truncated SVD
svd = TruncatedSVD(n_components=2)
X_reduced = svd.fit_transform(X_transformed)

# Learn a Naive Bayes classifier on the transformed data
nb = BernoulliNB()
nb.fit(X_transformed, y)

# Learn an ExtraTreesClassifier for comparison
trees = ExtraTreesClassifier(max_depth=3, n_estimators=10, random_state=0)
trees.fit(X, y)

Out[3]:
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=3, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)

Scatter Plot of Original and Reduced Data¶

In [4]:
fig = tools.make_subplots(rows=1, cols=2,
subplot_titles=("Original Data (2d)",
"Truncated SVD reduction (2d)<br>of transformed data (%dd)" %
X_transformed.shape[1])
)

original = go.Scatter(x=X[:, 0], y=X[:, 1],
mode='markers',
showlegend=False,
marker=dict(color=y,
colorscale='Viridis'),
)
fig.append_trace(original, 1, 1)

reduced = go.Scatter(x=X_reduced[:, 0], y=X_reduced[:, 1],
mode='markers',
showlegend=False,
marker=dict(color=y,
colorscale='Viridis')
)
fig.append_trace(reduced, 1, 2)

for i in map(str, range(1, 3)):
x = 'xaxis' + i
y = 'yaxis' + i
fig['layout'][x].update(showgrid=False, zeroline=False,
showticklabels=False, ticks='')
fig['layout'][y].update(showgrid=False, zeroline=False,
showticklabels=False, ticks='')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]


In [5]:
py.iplot(fig)

Out[5]:

Plot the decision in original space¶

we will assign a color to each point [x_min, x_max]x[y_min, y_max].

In [6]:
h = .01
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# transform grid using RandomTreesEmbedding
transformed_grid = hasher.transform(np.c_[xx.ravel(), yy.ravel()])
y_grid_pred = nb.predict_proba(transformed_grid)[:, 1]

fig = tools.make_subplots(rows=1, cols=2,
subplot_titles=("Naive Bayes on Transformed data",
"ExtraTrees predictions")
)

trace1 = go.Heatmap(x=xx[0], y=xx[0], z=y_grid_pred.reshape(xx.shape),
colorscale='Viridis',
showscale=False)
trace2 = go.Scatter(x=X[:, 0], y=X[:, 1],
mode='markers',
showlegend=False,
marker=dict(size=10,
color=xx[0],
colorscale='Viridis',
line=dict(color='black', width=1))
)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)

# transform grid using ExtraTreesClassifier
y_grid_pred = trees.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

trace3 = go.Heatmap(x=xx[0], y=xx[0],
z=y_grid_pred.reshape(xx.shape),
colorscale='Viridis',
showscale=False)

trace4 = go.Scatter(x=X[:, 0], y=X[:, 1],
mode='markers',
showlegend=False,
marker=dict(size=10,
color=xx[0],
colorscale='Viridis',
line=dict(color='black', width=1))
)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

for i in map(str, range(1, 3)):
x = 'xaxis' + i
y = 'yaxis' + i
fig['layout'][x].update(showgrid=False, zeroline=False,
showticklabels=False, ticks='')
fig['layout'][y].update(showgrid=False, zeroline=False,
showticklabels=False, ticks='')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]


In [7]:
py.iplot(fig)

Out[7]:
Still need help?