Show Sidebar Hide Sidebar

Cross-validation on diabetes Dataset Exercise in Scikit-learn

A tutorial exercise which uses cross-validation with linear models.

This exercise is used in the Cross-validated estimators part of the Model selection: choosing estimators and their parameters section of the A tutorial on statistical-learning for scientific data processing.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

This tutorial imports LassoCV, Lasso, KFold and cross_val_score.

In [2]:
from __future__ import print_function
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
Automatically created module for IPython interactive environment

Calculations

In [3]:
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]

lasso = Lasso(random_state=0)
alphas = np.logspace(-4, -0.5, 30)

scores = list()
scores_std = list()

n_folds = 3

for alpha in alphas:
    lasso.alpha = alpha
    this_scores = cross_val_score(lasso, X, y, cv=n_folds, n_jobs=1)
    scores.append(np.mean(this_scores))
    scores_std.append(np.std(this_scores))

scores, scores_std = np.array(scores), np.array(scores_std)

Plot cross-validation with linear models

In [4]:
p1 = go.Scatter(x=alphas, y=scores,
                mode='lines',
                line=dict(color='blue'),
                fill='tonexty'
               )

# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)

p2 = go.Scatter(x=alphas, y=scores + std_error, 
                mode='lines',
                line=dict(color='blue', dash='dash'),
                )

p3 = go.Scatter(x=alphas, y=scores - std_error,
                mode='lines',
                line=dict(color='blue', dash='dash'),
                fill='tonexty')

line = go.Scatter(y=[np.max(scores), np.max(scores)],
                 x=[min(alphas), max(alphas)],
                 mode='lines',
                 line=dict(color='black', dash='dash', 
                           width=1),
                )


layout = go.Layout(xaxis=dict(title='alpha', type='log'),
                   yaxis=dict(title='CV score +/- std error'),
                   showlegend=False
                   )
fig = go.Figure(data=[p2, p1, p3, line], layout=layout)
In [5]:
py.iplot(fig)
Out[5]:

Bonus Question

Bonus: how much can you trust the selection of alpha?

In [6]:
# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.
lasso_cv = LassoCV(alphas=alphas, random_state=0)
k_fold = KFold(3)

print("Answer to the bonus question:",
      "how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X, y)):
    lasso_cv.fit(X[train], y[train])
    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
          format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")
Answer to the bonus question: how much can you trust the selection of alpha?

Alpha parameters maximising the generalization score on different
subsets of the data:
[fold 0] alpha: 0.10405, score: 0.53573
[fold 1] alpha: 0.05968, score: 0.16278
[fold 2] alpha: 0.10405, score: 0.44437

Answer: Not very much since we obtained different alphas for different
subsets of the data and moreover, the scores for these alphas differ
quite substantially.
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.