Show Sidebar Hide Sidebar

# Robust Scaling on Toy Data in Scikit-learn

Making sure that each Feature has approximately the same scale can be a crucial preprocessing step. However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.

Here, we demonstrate this on a toy dataset, where one single datapoint is a large outlier.

#### New to Plotly?¶

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutprial imports StandardScaler and RobustScaler.

In [2]:
from __future__ import print_function
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
# Create training and test data
np.random.seed(42)
n_datapoints = 100
Cov = [[0.9, 0.0], [0.0, 20.0]]
mu1 = [100.0, -3.0]
mu2 = [101.0, -3.0]
X1 = np.random.multivariate_normal(mean=mu1, cov=Cov, size=n_datapoints)
X2 = np.random.multivariate_normal(mean=mu2, cov=Cov, size=n_datapoints)
Y_train = np.hstack([[-1]*n_datapoints, [1]*n_datapoints])
X_train = np.vstack([X1, X2])

X1 = np.random.multivariate_normal(mean=mu1, cov=Cov, size=n_datapoints)
X2 = np.random.multivariate_normal(mean=mu2, cov=Cov, size=n_datapoints)
Y_test = np.hstack([[-1]*n_datapoints, [1]*n_datapoints])
X_test = np.vstack([X1, X2])

X_train[0, 0] = -1000  # a fairly large outlier

# Scale data
standard_scaler = StandardScaler()
Xtr_s = standard_scaler.fit_transform(X_train)
Xte_s = standard_scaler.transform(X_test)

robust_scaler = RobustScaler()
Xtr_r = robust_scaler.fit_transform(X_train)
Xte_r = robust_scaler.transform(X_test)


### Plot Results¶

In [4]:
fig = tools.make_subplots(rows=1, cols=3,
print_grid=False,
subplot_titles=("Unscaled data",
"After standard scaling (zoomed in)",
"After robust scaling (zoomed in)"))

fig.append_trace(go.Scatter(x=X_train[:, 0],
y=X_train[:, 1],
mode='markers',
marker=dict(color=
np.where(Y_train > 0, 'red', 'blue'))), 1, 1)

fig.append_trace(go.Scatter(x=Xtr_s[:, 0],
y=Xtr_s[:, 1],
mode='markers',
marker=dict(color=
np.where(Y_train > 0, 'red', 'blue'))), 1, 2)

fig.append_trace(go.Scatter(x=Xtr_r[:, 0],
y=Xtr_r[:, 1],
mode='markers',
marker=dict(color=
np.where(Y_train > 0, 'red', 'blue'))), 1, 3)

fig['layout']['yaxis1'].update(zeroline=False)
fig['layout']['xaxis1'].update(zeroline=False)

for i in map(str, range(2, 4)):
y = 'yaxis' + i
x = 'xaxis' + i
fig['layout'][y].update(range=[-3, 3], zeroline=False)
fig['layout'][x].update(range=[-3, 3], zeroline=False)

fig['layout'].update(showlegend=False)

In [5]:
py.iplot(fig)

Out[5]:

### Classify using k-NN¶

In [8]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(Xtr_s, Y_train)
acc_s = knn.score(Xte_s, Y_test)
print("Testset accuracy using standard scaler: %.3f" % acc_s)
knn.fit(Xtr_r, Y_train)
acc_r = knn.score(Xte_r, Y_test)
print("Testset accuracy using robust scaler:   %.3f" % acc_r)

Testset accuracy using standard scaler: 0.545
Testset accuracy using robust scaler:   0.705


        Thomas Unterthiner

        BSD 3 clause