Show Sidebar Hide Sidebar

# Theil-Sen Regression in Scikit-learn

Computes a Theil-Sen Regression on a synthetic dataset.

Compared to the OLS (ordinary least squares) estimator, the Theil-Sen estimator is robust against outliers. It has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional case.

The estimation of the model is done by calculating the slopes and intercepts of a subpopulation of all possible combinations of p subsample points. If an intercept is fitted, p must be greater than or equal to n_features + 1. The final slope and intercept is then defined as the spatial median of these slopes and intercepts.

In certain cases Theil-Sen performs better than RANSAC which is also a robust method. This is illustrated in the second example below where outliers with respect to the x-axis perturb RANSAC. Tuning the residual_threshold parameter of RANSAC remedies this but in general a priori knowledge about the data and the nature of the outliers is needed. Due to the computational complexity of Theil-Sen it is recommended to use it only for small problems in terms of number of samples and features. For larger problems the max_subpopulation parameter restricts the magnitude of all possible combinations of p subsample points to a randomly chosen subset and therefore also limits the runtime. Therefore, Theil-Sen is applicable to larger problems with the drawback of losing some of its mathematical properties since it then works on a random subset.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports LinearRegression, TheilSenRegressor and RANSACRegressor.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import time
import numpy as np
from sklearn.linear_model import LinearRegression, TheilSenRegressor
from sklearn.linear_model import RANSACRegressor

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
estimators = [('OLS', LinearRegression()),
('Theil-Sen', TheilSenRegressor(random_state=42)),
('RANSAC', RANSACRegressor(random_state=42)), ]
colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen'}
lw = 2


### Outliers only in the y direction¶

In [4]:
np.random.seed(0)
n_samples = 200
# Linear model y = 3*x + N(2, 0.1**2)
x = np.random.randn(n_samples)
w = 3.
c = 2.
noise = 0.1 * np.random.randn(n_samples)
y = w * x + c + noise
# 10% outliers
y[-20:] += -20 * x[-20:]
X = x[:, np.newaxis]

In [5]:
data = []
trace = go.Scatter(x=x, y=y,
mode='markers',
marker=dict(color='indigo'),
showlegend=False)
data.append(trace)

line_x = np.array([-3, 3])

for name, estimator in estimators:
t0 = time.time()
estimator.fit(X, y)
elapsed_time = time.time() - t0
y_pred = estimator.predict(line_x.reshape(2, 1))
trace1 = go.Scatter(x=line_x, y=y_pred,
mode='lines',
line=dict(color=colors[name], width=lw),
name='%s (fit time: %.2fs)' % (name, elapsed_time))
data.append(trace1)

layout = go.Layout(title="Corrupt y",
xaxis=dict(zeroline=False, showgrid=False),
yaxis=dict(zeroline=False, showgrid=False),
)
fig = go.Figure(data=data, layout=layout)

In [6]:
py.iplot(fig)

Out[6]:

### Outliers only in the x direction¶

In [7]:
np.random.seed(0)
# Linear model y = 3*x + N(2, 0.1**2)
x = np.random.randn(n_samples)
noise = 0.1 * np.random.randn(n_samples)
y = 3 * x + 2 + noise
# 10% outliers
x[-20:] = 9.9
y[-20:] += 22
X = x[:, np.newaxis]

In [8]:
data = []
trace = go.Scatter(x=x, y=y,
mode='markers',
marker=dict(color='indigo'),
showlegend=False)
data.append(trace)

line_x = np.array([-3, 10])

for name, estimator in estimators:
t0 = time.time()
estimator.fit(X, y)
elapsed_time = time.time() - t0
y_pred = estimator.predict(line_x.reshape(2, 1))

trace1 = go.Scatter(x=line_x, y=y_pred,
mode='lines',
line=dict(color=colors[name], width=lw),
name='%s (fit time: %.2fs)' % (name, elapsed_time))
data.append(trace1)

layout = go.Layout(title="Corrupt x",
xaxis=dict(zeroline=False, showgrid=False),
yaxis=dict(zeroline=False, showgrid=False),
)
fig = go.Figure(data=data, layout=layout)

In [9]:
py.iplot(fig)

Out[9]:

Author:

    Florian Wilhelm -- <florian.wilhelm@gmail.com>



    BSD 3 clause