Show Sidebar Hide Sidebar

# Pipelining in Scikit-learn

Pipelining: chaining a PCA and a logistic regression

The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorial imports Pipeline and GridSearchCV.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


### Calculations¶

In [3]:
print(__doc__)

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

X_digits = digits.data
y_digits = digits.target

Automatically created module for IPython interactive environment


### PCA Spectrum Plot¶

In [4]:
pca.fit(X_digits)

trace1 = go.Scatter(y=pca.explained_variance_ ,
mode="lines", line=dict(
width=2,
color='blue'),
name="PCA Spectrum"
)
layout1 = go.Layout(xaxis=dict(
title="n_components"),
yaxis=dict(
title="explained_variance_"))
fig1 = go.Figure(data=[trace1], layout=layout1)
py.iplot(fig1, filename="PCA-Spectrum")

Out[4]:

### Prediction Plot¶

In [5]:
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)

#Parameters of pipelines can be set using ‘__’ separated parameter names:

estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
logistic__C=Cs))

estimator.fit(X_digits, y_digits)
x_ = estimator.best_estimator_.named_steps['pca'].n_components

trace2 = go.Scatter(x = [x_ , x_], y=[0, 1],
mode="lines", line=dict(
width=2,
dash='dot'),
name="n_components chosen",
)
layout2 = go.Layout(showlegend=True)
fig2 = go.Figure(data=[trace2], layout=layout2)

py.iplot(fig2, filename = "Prediction")

Out[5]:

### Combined Plot¶

In [6]:
trace2 = go.Scatter(x=[x_ , x_], y=[0, 178],
mode="lines", line=dict(
width=1,
dash='dot',
color="rgb(10 ,10 , 240)"),
name="n_components chosen",
)
layout3 = go.Layout(xaxis=dict(
title="n_components"),
yaxis=dict(
title="explained_variance_"))
fig3 = go.Figure(data=[trace1, trace2], layout=layout3)
py.iplot(fig3, filename="pipeline")

Out[6]:

Code source:

            Gaël Varoquaux



            BSD 3 clause