Show Sidebar Hide Sidebar

# Dimensionality Reduction in Scikit-learn

Selecting dimensionality reduction with Pipeline and GridSearchCV

This example constructs a pipeline that does dimensionality reduction followed by prediction with a support vector classifier. It demonstrates the use of GridSearchCV and Pipeline to optimize over different classes of estimators in a single CV run – unsupervised PCA and NMF dimensionality reductions are compared to univariate feature selection during the grid search.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorial imports load_digits, GridSearchCV, Pipeline, LinearSVC, PCA, NMF, SelectKBest and chi2

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

from __future__ import print_function, division
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2


### Calculations¶

In [3]:
print(__doc__)

pipe = Pipeline([
('reduce_dim', PCA()),
('classify', LinearSVC())
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
grid.fit(digits.data, digits.target)

mean_scores = np.array(grid.cv_results_['mean_test_score'])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *
(len(reducer_labels) + 1) + .5)

Automatically created module for IPython interactive environment


### Plotting Comparison of feature reduction techniques¶

In [4]:
data = []
COLORS = ['blue','green','red']
for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
trace = go.Bar(x=bar_offsets + i, y = reducer_scores, name=label,  marker=dict(
color=COLORS[i]))
data.append(trace)
layout = go.Layout(
title = "Comparing feature reduction techniques",
xaxis = dict(
dtick=2,
title="Reduced number of features",),
yaxis = dict(
title="Digit classification accuracy",
range= [0,1]))
fig = go.Figure(data=data,layout=layout)

py.iplot(fig, filename="dimensionality-reduction")

Out[4]:

    Robert McGibbon
Joel Nothman