Show Sidebar Hide Sidebar

Dimensionality Reduction in Scikit-learn

Selecting dimensionality reduction with Pipeline and GridSearchCV

This example constructs a pipeline that does dimensionality reduction followed by prediction with a support vector classifier. It demonstrates the use of GridSearchCV and Pipeline to optimize over different classes of estimators in a single CV run – unsupervised PCA and NMF dimensionality reductions are compared to univariate feature selection during the grid search.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18'

Imports

This tutorial imports load_digits, GridSearchCV, Pipeline, LinearSVC, PCA, NMF, SelectKBest and chi2

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

from __future__ import print_function, division
import numpy as np

from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

Calculations

In [3]:
print(__doc__)

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('classify', LinearSVC())
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

mean_scores = np.array(grid.cv_results_['mean_test_score'])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *
               (len(reducer_labels) + 1) + .5)
Automatically created module for IPython interactive environment

Plotting Comparison of feature reduction techniques

In [4]:
data = []
COLORS = ['blue','green','red']
for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
    trace = go.Bar(x=bar_offsets + i, y = reducer_scores, name=label,  marker=dict(
                color=COLORS[i]))
    data.append(trace)
layout = go.Layout(
                title = "Comparing feature reduction techniques",
                xaxis = dict(
                    dtick=2,
                    title="Reduced number of features",),
                yaxis = dict(
                    title="Digit classification accuracy",
                    range= [0,1]))
fig = go.Figure(data=data,layout=layout)

py.iplot(fig, filename="dimensionality-reduction")
Out[4]:

License

Authors:

    Robert McGibbon
    Joel Nothman
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.