Show Sidebar Hide Sidebar

Robust Scaling on Toy Data in Scikit-learn

Making sure that each Feature has approximately the same scale can be a crucial preprocessing step. However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers.

Here, we demonstrate this on a toy dataset, where one single datapoint is a large outlier.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

This tutprial imports StandardScaler and RobustScaler.

In [2]:
from __future__ import print_function
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
Automatically created module for IPython interactive environment

Calculations

In [3]:
# Create training and test data
np.random.seed(42)
n_datapoints = 100
Cov = [[0.9, 0.0], [0.0, 20.0]]
mu1 = [100.0, -3.0]
mu2 = [101.0, -3.0]
X1 = np.random.multivariate_normal(mean=mu1, cov=Cov, size=n_datapoints)
X2 = np.random.multivariate_normal(mean=mu2, cov=Cov, size=n_datapoints)
Y_train = np.hstack([[-1]*n_datapoints, [1]*n_datapoints])
X_train = np.vstack([X1, X2])

X1 = np.random.multivariate_normal(mean=mu1, cov=Cov, size=n_datapoints)
X2 = np.random.multivariate_normal(mean=mu2, cov=Cov, size=n_datapoints)
Y_test = np.hstack([[-1]*n_datapoints, [1]*n_datapoints])
X_test = np.vstack([X1, X2])

X_train[0, 0] = -1000  # a fairly large outlier


# Scale data
standard_scaler = StandardScaler()
Xtr_s = standard_scaler.fit_transform(X_train)
Xte_s = standard_scaler.transform(X_test)

robust_scaler = RobustScaler()
Xtr_r = robust_scaler.fit_transform(X_train)
Xte_r = robust_scaler.transform(X_test)

Plot Results

In [4]:
fig = tools.make_subplots(rows=1, cols=3,
                          print_grid=False,
                          subplot_titles=("Unscaled data",
                                          "After standard scaling (zoomed in)",
                                          "After robust scaling (zoomed in)"))

fig.append_trace(go.Scatter(x=X_train[:, 0],
                            y=X_train[:, 1],
                            mode='markers',
                            marker=dict(color=
                                        np.where(Y_train > 0, 'red', 'blue'))), 1, 1)
                            
fig.append_trace(go.Scatter(x=Xtr_s[:, 0], 
                            y=Xtr_s[:, 1], 
                            mode='markers',
                            marker=dict(color=
                                        np.where(Y_train > 0, 'red', 'blue'))), 1, 2)

fig.append_trace(go.Scatter(x=Xtr_r[:, 0], 
                            y=Xtr_r[:, 1], 
                            mode='markers',
                            marker=dict(color=
                                        np.where(Y_train > 0, 'red', 'blue'))), 1, 3)

fig['layout']['yaxis1'].update(zeroline=False)
fig['layout']['xaxis1'].update(zeroline=False)

for i in map(str, range(2, 4)):
        y = 'yaxis' + i
        x = 'xaxis' + i
        fig['layout'][y].update(range=[-3, 3], zeroline=False)
        fig['layout'][x].update(range=[-3, 3], zeroline=False)

fig['layout'].update(showlegend=False)
In [5]:
py.iplot(fig)
Out[5]:

Classify using k-NN

In [8]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(Xtr_s, Y_train)
acc_s = knn.score(Xte_s, Y_test)
print("Testset accuracy using standard scaler: %.3f" % acc_s)
knn.fit(Xtr_r, Y_train)
acc_r = knn.score(Xte_r, Y_test)
print("Testset accuracy using robust scaler:   %.3f" % acc_r)
Testset accuracy using standard scaler: 0.545
Testset accuracy using robust scaler:   0.705

License

Code source:

        Thomas Unterthiner

License:

        BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.