Show Sidebar Hide Sidebar

Gradient Boosting Out-of-Bag Estimates in Scikit-learn

Out-of-bag (OOB) estimates can be a useful heuristic to estimate the “optimal” number of boosting iterations. OOB estimates are almost identical to cross-validation estimates but they can be computed on-the-fly without the need for repeated model fitting. OOB estimates are only available for Stochastic Gradient Boosting (i.e. subsample < 1.0), the estimates are derived from the improvement in loss based on the examples not included in the bootstrap sample (the so-called out-of-bag examples). The OOB estimator is a pessimistic estimator of the true test loss, but remains a fairly good approximation for a small number of trees.

The figure shows the cumulative sum of the negative OOB improvements as a function of the boosting iteration. As you can see, it tracks the test loss for the first hundred iterations but then diverges in a pessimistic way. The figure also shows the performance of 3-fold cross validation which usually gives a better estimate of the test loss but is computationally more demanding.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

This tutorial imports KFold and train_test_split.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
from sklearn import ensemble
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
Automatically created module for IPython interactive environment

Calculations

In [3]:
# Generate data (adapted from G. Ridgeway's gbm example)
n_samples = 1000
random_state = np.random.RandomState(13)
x1 = random_state.uniform(size=n_samples)
x2 = random_state.uniform(size=n_samples)
x3 = random_state.randint(0, 4, size=n_samples)

p = 1 / (1.0 + np.exp(-(np.sin(3 * x1) - 4 * x2 + x3)))
y = random_state.binomial(1, p, size=n_samples)

X = np.c_[x1, x2, x3]

X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                    random_state=9)

# Fit classifier with out-of-bag estimates
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)

clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Accuracy: {:.4f}".format(acc))

n_estimators = params['n_estimators']
x = np.arange(n_estimators) + 1


def heldout_score(clf, X_test, y_test):
    """compute deviance scores on ``X_test`` and ``y_test``. """
    score = np.zeros((n_estimators,), dtype=np.float64)
    for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
        score[i] = clf.loss_(y_test, y_pred)
    return score


def cv_estimate(n_splits=3):
    cv = KFold(n_splits=n_splits)
    cv_clf = ensemble.GradientBoostingClassifier(**params)
    val_scores = np.zeros((n_estimators,), dtype=np.float64)
    for train, test in cv.split(X_train, y_train):
        cv_clf.fit(X_train[train], y_train[train])
        val_scores += heldout_score(cv_clf, X_train[test], y_train[test])
    val_scores /= n_splits
    return val_scores


# Estimate best n_estimator using cross-validation
cv_score = cv_estimate(3)

# Compute best n_estimator for test data
test_score = heldout_score(clf, X_test, y_test)

# negative cumulative sum of oob improvements
cumsum = -np.cumsum(clf.oob_improvement_)

# min loss according to OOB
oob_best_iter = x[np.argmin(cumsum)]

# min loss according to test (normalize such that first loss is 0)
test_score -= test_score[0]
test_best_iter = x[np.argmin(test_score)]

# min loss according to cv (normalize such that first loss is 0)
cv_score -= cv_score[0]
cv_best_iter = x[np.argmin(cv_score)]

# color brew for the three curves
oob_color = 'purple'
test_color = 'green'
cv_color = 'orange'
Accuracy: 0.6840

Plot Results

In [4]:
p1 = go.Scatter(x=x, y=cumsum, 
                name='OOB loss', 
                mode='lines',
                line=dict(color=oob_color, width=1)
                )

p2 = go.Scatter(x=x, y=test_score, 
                name='Test loss', 
                mode='lines',
                line=dict(color=test_color, width=1)
                )

p3 = go.Scatter(x=x, y=cv_score, 
                name='CV loss',
                mode='lines',
                line=dict(color=cv_color, width=1 )
                )

p4 = go.Scatter(x=2 * [oob_best_iter], 
                y=[-0.3, 0.1],
                showlegend=False,
                mode='lines',
                line=dict(color=oob_color, width=1)
                )

p5 = go.Scatter(x=2 * [test_best_iter],
                y=[-0.3, 0.1],
                showlegend=False,
                mode='lines',
                line=dict(color=test_color, width=1)
                )

p6 = go.Scatter(x=2 * [cv_best_iter], 
                y=[-0.3, 0.1],
                showlegend=False,
                mode='lines',
                line=dict(color=cv_color, width=1)
                )

layout = go.Layout(xaxis=dict(title='number of iterations',
                              ticktext=['0','OOB','CV','Test','400',
                                        '600','800','1000','1200'],
                              tickvals=[0, oob_best_iter, cv_best_iter,
                                        test_best_iter, 400, 600, 800,
                                        1000, 1200]),
                  yaxis=dict(title='normalized loss'),
                  hovermode='closest'
                  )

fig = go.Figure(data=[p1, p2, p3, p4, p5, p6 ], layout=layout)
In [5]:
py.iplot(fig)
Out[5]:

License

Author:

    Peter Prettenhofer <peter.prettenhofer@gmail.com>

License:

    BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.