Show Sidebar Hide Sidebar

# Robust Scaling on Toy Data in Scikit-learn

Making sure that each Feature has approximately the same scale can be a crucial preprocessing step. However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.

Here, we demonstrate this on a toy dataset, where one single datapoint is a large outlier.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutprial imports StandardScaler and RobustScaler.

In [2]:
from __future__ import print_function
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
# Create training and test data
np.random.seed(42)
n_datapoints = 100
Cov = [[0.9, 0.0], [0.0, 20.0]]
mu1 = [100.0, -3.0]
mu2 = [101.0, -3.0]
X1 = np.random.multivariate_normal(mean=mu1, cov=Cov, size=n_datapoints)
X2 = np.random.multivariate_normal(mean=mu2, cov=Cov, size=n_datapoints)
Y_train = np.hstack([[-1]*n_datapoints, [1]*n_datapoints])
X_train = np.vstack([X1, X2])

X1 = np.random.multivariate_normal(mean=mu1, cov=Cov, size=n_datapoints)
X2 = np.random.multivariate_normal(mean=mu2, cov=Cov, size=n_datapoints)
Y_test = np.hstack([[-1]*n_datapoints, [1]*n_datapoints])
X_test = np.vstack([X1, X2])

X_train[0, 0] = -1000  # a fairly large outlier

# Scale data
standard_scaler = StandardScaler()
Xtr_s = standard_scaler.fit_transform(X_train)
Xte_s = standard_scaler.transform(X_test)

robust_scaler = RobustScaler()
Xtr_r = robust_scaler.fit_transform(X_train)
Xte_r = robust_scaler.transform(X_test)


### Plot Results¶

In [4]:
fig = tools.make_subplots(rows=1, cols=3,
print_grid=False,
subplot_titles=("Unscaled data",
"After standard scaling (zoomed in)",
"After robust scaling (zoomed in)"))

fig.append_trace(go.Scatter(x=X_train[:, 0],
y=X_train[:, 1],
mode='markers',
marker=dict(color=
np.where(Y_train > 0, 'red', 'blue'))), 1, 1)

fig.append_trace(go.Scatter(x=Xtr_s[:, 0],
y=Xtr_s[:, 1],
mode='markers',
marker=dict(color=
np.where(Y_train > 0, 'red', 'blue'))), 1, 2)

fig.append_trace(go.Scatter(x=Xtr_r[:, 0],
y=Xtr_r[:, 1],
mode='markers',
marker=dict(color=
np.where(Y_train > 0, 'red', 'blue'))), 1, 3)

fig['layout']['yaxis1'].update(zeroline=False)
fig['layout']['xaxis1'].update(zeroline=False)

for i in map(str, range(2, 4)):
y = 'yaxis' + i
x = 'xaxis' + i
fig['layout'][y].update(range=[-3, 3], zeroline=False)
fig['layout'][x].update(range=[-3, 3], zeroline=False)

fig['layout'].update(showlegend=False)

In [5]:
py.iplot(fig)

Out[5]:

### Classify using k-NN¶

In [8]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(Xtr_s, Y_train)
acc_s = knn.score(Xte_s, Y_test)
print("Testset accuracy using standard scaler: %.3f" % acc_s)
knn.fit(Xtr_r, Y_train)
acc_r = knn.score(Xte_r, Y_test)
print("Testset accuracy using robust scaler:   %.3f" % acc_r)

Testset accuracy using standard scaler: 0.545
Testset accuracy using robust scaler:   0.705


        Thomas Unterthiner

        BSD 3 clause