# Gradient Boosting Out-of-Bag Estimates in Scikit-learn

Out-of-bag (OOB) estimates can be a useful heuristic to estimate the “optimal” number of boosting iterations. OOB estimates are almost identical to cross-validation estimates but they can be computed on-the-fly without the need for repeated model fitting. OOB estimates are only available for Stochastic Gradient Boosting (i.e. subsample < 1.0), the estimates are derived from the improvement in loss based on the examples not included in the bootstrap sample (the so-called out-of-bag examples). The OOB estimator is a pessimistic estimator of the true test loss, but remains a fairly good approximation for a small number of trees.

The figure shows the cumulative sum of the negative OOB improvements as a function of the boosting iteration. As you can see, it tracks the test loss for the first hundred iterations but then diverges in a pessimistic way. The figure also shows the performance of 3-fold cross validation which usually gives a better estimate of the test loss but is computationally more demanding.

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports KFold and train_test_split.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
from sklearn import ensemble
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
# Generate data (adapted from G. Ridgeway's gbm example)
n_samples = 1000
random_state = np.random.RandomState(13)
x1 = random_state.uniform(size=n_samples)
x2 = random_state.uniform(size=n_samples)
x3 = random_state.randint(0, 4, size=n_samples)

p = 1 / (1.0 + np.exp(-(np.sin(3 * x1) - 4 * x2 + x3)))
y = random_state.binomial(1, p, size=n_samples)

X = np.c_[x1, x2, x3]

X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
random_state=9)

# Fit classifier with out-of-bag estimates
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}

clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Accuracy: {:.4f}".format(acc))

n_estimators = params['n_estimators']
x = np.arange(n_estimators) + 1

def heldout_score(clf, X_test, y_test):
"""compute deviance scores on X_test and y_test. """
score = np.zeros((n_estimators,), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
score[i] = clf.loss_(y_test, y_pred)
return score

def cv_estimate(n_splits=3):
cv = KFold(n_splits=n_splits)
val_scores = np.zeros((n_estimators,), dtype=np.float64)
for train, test in cv.split(X_train, y_train):
cv_clf.fit(X_train[train], y_train[train])
val_scores += heldout_score(cv_clf, X_train[test], y_train[test])
val_scores /= n_splits
return val_scores

# Estimate best n_estimator using cross-validation
cv_score = cv_estimate(3)

# Compute best n_estimator for test data
test_score = heldout_score(clf, X_test, y_test)

# negative cumulative sum of oob improvements
cumsum = -np.cumsum(clf.oob_improvement_)

# min loss according to OOB
oob_best_iter = x[np.argmin(cumsum)]

# min loss according to test (normalize such that first loss is 0)
test_score -= test_score[0]
test_best_iter = x[np.argmin(test_score)]

# min loss according to cv (normalize such that first loss is 0)
cv_score -= cv_score[0]
cv_best_iter = x[np.argmin(cv_score)]

# color brew for the three curves
oob_color = 'purple'
test_color = 'green'
cv_color = 'orange'

Accuracy: 0.6840


### Plot Results¶

In [4]:
p1 = go.Scatter(x=x, y=cumsum,
name='OOB loss',
mode='lines',
line=dict(color=oob_color, width=1)
)

p2 = go.Scatter(x=x, y=test_score,
name='Test loss',
mode='lines',
line=dict(color=test_color, width=1)
)

p3 = go.Scatter(x=x, y=cv_score,
name='CV loss',
mode='lines',
line=dict(color=cv_color, width=1 )
)

p4 = go.Scatter(x=2 * [oob_best_iter],
y=[-0.3, 0.1],
showlegend=False,
mode='lines',
line=dict(color=oob_color, width=1)
)

p5 = go.Scatter(x=2 * [test_best_iter],
y=[-0.3, 0.1],
showlegend=False,
mode='lines',
line=dict(color=test_color, width=1)
)

p6 = go.Scatter(x=2 * [cv_best_iter],
y=[-0.3, 0.1],
showlegend=False,
mode='lines',
line=dict(color=cv_color, width=1)
)

layout = go.Layout(xaxis=dict(title='number of iterations',
ticktext=['0','OOB','CV','Test','400',
'600','800','1000','1200'],
tickvals=[0, oob_best_iter, cv_best_iter,
test_best_iter, 400, 600, 800,
1000, 1200]),
yaxis=dict(title='normalized loss'),
hovermode='closest'
)

fig = go.Figure(data=[p1, p2, p3, p4, p5, p6 ], layout=layout)

In [5]:
py.iplot(fig)

Out[5]:

Author:

    Peter Prettenhofer <peter.prettenhofer@gmail.com>



    BSD 3 clause