Show Sidebar Hide Sidebar

# Nested Versus Non-Nested Cross-Validation in Scikit-learn

This example compares non-nested and nested cross-validation strategies on a classifier of the iris data set. Nested cross-validation (CV) is often used to train a model in which hyperparameters also need to be optimized. Nested CV estimates the generalization error of the underlying model and its (hyper)parameter search. Choosing the parameters that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic score.

Model selection without nested CV uses the same data to tune model parameters and evaluate model performance. Information may thus “leak” into the model and overfit the data. The magnitude of this effect is primarily dependent on the size of the dataset and the stability of the model. See Cawley and Talbot [1] for an analysis of these issues.

To avoid this problem, nested CV effectively uses a series of train/validation/test set splits. In the inner loop, the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop, generalization error is estimated by averaging test set scores over several dataset splits.

The example below uses a support vector classifier with a non-linear kernel to build a model with optimized hyperparameters by grid search. We compare the performance of non-nested and nested CV strategies by taking the difference between their scores.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports load_iris, SVC, GridSearchCV, cross_val_score and KFold.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
import numpy as np


### Calculations¶

In [3]:
# Number of random trials
NUM_TRIALS = 30

X_iris = iris.data
y_iris = iris.target

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10, 100],
"gamma": [.01, .1]}

# We will use a Support Vector Classifier with "rbf" kernel
svr = SVC(kernel="rbf")

# Arrays to store scores
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)

# Loop for each trial
for i in range(NUM_TRIALS):

# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_

# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()

score_difference = non_nested_scores - nested_scores

print("Average difference of {0:6f} with std. dev. of {1:6f}."
.format(score_difference.mean(), score_difference.std()))

Average difference of 0.007742 with std. dev. of 0.007688.


### Plot Results¶

In [4]:
fig = tools.make_subplots(rows=2, cols=1,
print_grid=False)

non_nested_scores_line = go.Scatter(y=non_nested_scores,
mode='lines',
line=dict(color='red', width=2),
name="Non-Nested CV")
fig.append_trace(non_nested_scores_line, 1, 1)
nested_line = go.Scatter(y=nested_scores,
mode='lines',
line=dict(color='blue', width=2),
name="Nested CV"
)
fig.append_trace(nested_line, 1, 1)
fig['layout']['yaxis1'].update(title='score')
fig['layout'].update(title='Non-Nested and Nested Cross Validation on Iris Dataset',
hovermode='closest', height=700)

# Plot bar chart of the difference.

difference_plot = go.Bar(x=range(NUM_TRIALS),
y=score_difference,
showlegend=False)
fig.append_trace(difference_plot, 2, 1)
fig['layout']['xaxis2'].update(title='Individual Trial #')
fig['layout']['yaxis2'].update(title='score difference')

In [5]:
py.iplot(fig)

Out[5]:

Still need help?