Show Sidebar Hide Sidebar

Theil-Sen Regression in Scikit-learn

Computes a Theil-Sen Regression on a synthetic dataset.

See Theil-Sen estimator: generalized-median-based estimator for more information on the regressor.

Compared to the OLS (ordinary least squares) estimator, the Theil-Sen estimator is robust against outliers. It has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional case.

The estimation of the model is done by calculating the slopes and intercepts of a subpopulation of all possible combinations of p subsample points. If an intercept is fitted, p must be greater than or equal to n_features + 1. The final slope and intercept is then defined as the spatial median of these slopes and intercepts.

In certain cases Theil-Sen performs better than RANSAC which is also a robust method. This is illustrated in the second example below where outliers with respect to the x-axis perturb RANSAC. Tuning the residual_threshold parameter of RANSAC remedies this but in general a priori knowledge about the data and the nature of the outliers is needed. Due to the computational complexity of Theil-Sen it is recommended to use it only for small problems in terms of number of samples and features. For larger problems the max_subpopulation parameter restricts the magnitude of all possible combinations of p subsample points to a randomly chosen subset and therefore also limits the runtime. Therefore, Theil-Sen is applicable to larger problems with the drawback of losing some of its mathematical properties since it then works on a random subset.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

This tutorial imports LinearRegression, TheilSenRegressor and RANSACRegressor.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import time
import numpy as np
from sklearn.linear_model import LinearRegression, TheilSenRegressor
from sklearn.linear_model import RANSACRegressor
Automatically created module for IPython interactive environment

Calculations

In [3]:
estimators = [('OLS', LinearRegression()),
              ('Theil-Sen', TheilSenRegressor(random_state=42)),
              ('RANSAC', RANSACRegressor(random_state=42)), ]
colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen'}
lw = 2

Outliers only in the y direction

In [4]:
np.random.seed(0)
n_samples = 200
# Linear model y = 3*x + N(2, 0.1**2)
x = np.random.randn(n_samples)
w = 3.
c = 2.
noise = 0.1 * np.random.randn(n_samples)
y = w * x + c + noise
# 10% outliers
y[-20:] += -20 * x[-20:]
X = x[:, np.newaxis]
In [5]:
data = []
trace = go.Scatter(x=x, y=y, 
                   mode='markers', 
                   marker=dict(color='indigo'),
                   showlegend=False)
data.append(trace)

line_x = np.array([-3, 3])

for name, estimator in estimators:
    t0 = time.time()
    estimator.fit(X, y)
    elapsed_time = time.time() - t0
    y_pred = estimator.predict(line_x.reshape(2, 1))
    trace1 = go.Scatter(x=line_x, y=y_pred, 
                        mode='lines', 
                        line=dict(color=colors[name], width=lw),
                        name='%s (fit time: %.2fs)' % (name, elapsed_time))
    data.append(trace1)

layout = go.Layout(title="Corrupt y",
                   xaxis=dict(zeroline=False, showgrid=False),
                   yaxis=dict(zeroline=False, showgrid=False),
                  )
fig = go.Figure(data=data, layout=layout)
In [6]:
py.iplot(fig)
Out[6]:

Outliers only in the x direction

In [7]:
np.random.seed(0)
# Linear model y = 3*x + N(2, 0.1**2)
x = np.random.randn(n_samples)
noise = 0.1 * np.random.randn(n_samples)
y = 3 * x + 2 + noise
# 10% outliers
x[-20:] = 9.9
y[-20:] += 22
X = x[:, np.newaxis]
In [8]:
data = []
trace = go.Scatter(x=x, y=y, 
                   mode='markers', 
                   marker=dict(color='indigo'),
                   showlegend=False)
data.append(trace)

line_x = np.array([-3, 10])

for name, estimator in estimators:
    t0 = time.time()
    estimator.fit(X, y)
    elapsed_time = time.time() - t0
    y_pred = estimator.predict(line_x.reshape(2, 1))

    trace1 = go.Scatter(x=line_x, y=y_pred, 
                        mode='lines', 
                        line=dict(color=colors[name], width=lw),
                        name='%s (fit time: %.2fs)' % (name, elapsed_time))
    data.append(trace1)

layout = go.Layout(title="Corrupt x",
                   xaxis=dict(zeroline=False, showgrid=False),
                   yaxis=dict(zeroline=False, showgrid=False),
                  )
fig = go.Figure(data=data, layout=layout)
In [9]:
py.iplot(fig)
Out[9]:

License

Author:

    Florian Wilhelm -- <florian.wilhelm@gmail.com>

License:

    BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.