Show Sidebar Hide Sidebar

# Gradient Boosting Regression in Scikit-learn

Demonstrate Gradient Boosting on the Boston housing dataset.

This example fits a Gradient Boosting model with least squares loss and 500 regression trees of depth 4.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports shuffle and mean_squared_error.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]


Fit regression model

In [4]:
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
'learning_rate': 0.01, 'loss': 'ls'}

clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)

MSE: 6.6706


### Plot Training Deviance¶

In [5]:
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_predict(X_test)):
test_score[i] = clf.loss_(y_test, y_pred)

train = go.Scatter(x=np.arange(params['n_estimators']) + 1,
y=clf.train_score_,
name='Training Set Deviance',
mode='lines',
line=dict(color='blue')
)
test = go.Scatter(x=np.arange(params['n_estimators']) + 1,
y=test_score,
mode='lines',
name='Test Set Deviance',
line=dict(color='red')
)

layout = go.Layout(title='Deviance',
xaxis=dict(title='Boosting Iterations'),
yaxis=dict(title='Deviance')
)
fig = go.Figure(data=[test, train], layout=layout)

In [6]:
py.iplot(fig)

Out[6]:

### Plot Feature Importance¶

In [7]:
feature_importance = clf.feature_importances_

# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

trace = go.Bar(x=feature_importance[sorted_idx],
y=boston.feature_names[sorted_idx],
orientation = 'h'
)

layout = go.Layout(xaxis=dict(title='Relative Importance'),
yaxis=dict(title='Variable Importance')
)
fig = go.Figure(data=[trace], layout=layout)

In [8]:
py.iplot(fig)

Out[8]:

Author:

    Peter Prettenhofer <peter.prettenhofer@gmail.com>



    BSD 3 clause