Show Sidebar Hide Sidebar

Concatenating Multiple Feature Extraction Methods in Scikit-learn

In many real-world examples, there are many ways to extract features from a dataset. Often it is beneficial to combine several methods to obtain good performance. This example shows how to use FeatureUnion to combine features obtained by PCA and univariate selection.

Combining features using this transformer has the benefit that it allows cross validation and grid searches over the whole process.

The combination used in this example is not particularly helpful on this dataset and is only used to illustrate the usage of FeatureUnion.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
print sklearn.__version__
0.17.1

Imports

In [2]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

Calculations

In [3]:
iris = load_iris()

X, y = iris.data, iris.target

# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=1)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

svm = SVC(kernel="linear")

# Do grid search over k, n_components and C:

pipeline = Pipeline([("features", combined_features), ("svm", svm)])

param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[1, 2],
                  svm__C=[0.1, 1, 10])

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1, score=0.960784 -   0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1, score=0.901961 -   0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=1, features__univ_select__k=1, score=0.941176 -   0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=1, features__univ_select__k=1, score=0.921569 -   0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=1, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=10, features__univ_select__k=1, score=0.960784 -   0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=10, features__univ_select__k=1, score=0.921569 -   0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=1, svm__C=10, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2, score=0.960784 -   0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2, score=0.921569 -   0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2, score=0.979167 -   0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=1, features__univ_select__k=2, score=0.960784 -   0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=1, features__univ_select__k=2, score=0.921569 -   0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=1, features__univ_select__k=2, score=1.000000 -   0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=10, features__univ_select__k=2, score=0.980392 -   0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=10, features__univ_select__k=2, score=0.901961 -   0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=1, svm__C=10, features__univ_select__k=2, score=1.000000 -   0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1, score=0.960784 -   0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1, score=0.901961 -   0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=1, features__univ_select__k=1, score=0.980392 -   0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=1, features__univ_select__k=1, score=0.941176 -   0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=1, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=10, features__univ_select__k=1, score=0.980392 -   0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=10, features__univ_select__k=1, score=0.941176 -   0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=2, svm__C=10, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2
[Parallel(n_jobs=1)]: Done   1 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done   4 tasks       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done   7 tasks       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  12 tasks       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  17 tasks       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  24 tasks       | elapsed:    0.2s
[CV]  features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2, score=0.980392 -   0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2, score=0.941176 -   0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2, score=0.979167 -   0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=1, features__univ_select__k=2, score=1.000000 -   0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=1, features__univ_select__k=2, score=0.960784 -   0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=1, features__univ_select__k=2, score=0.979167 -   0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=10, features__univ_select__k=2, score=0.980392 -   0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=10, features__univ_select__k=2, score=0.921569 -   0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=2, svm__C=10, features__univ_select__k=2, score=1.000000 -   0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1, score=0.980392 -   0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1, score=0.941176 -   0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=1, features__univ_select__k=1, score=1.000000 -   0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=1, features__univ_select__k=1, score=0.941176 -   0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=1, features__univ_select__k=1, score=0.979167 -   0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=10, features__univ_select__k=1, score=1.000000 -   0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=10, features__univ_select__k=1, score=0.921569 -   0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1
[CV]  features__pca__n_components=3, svm__C=10, features__univ_select__k=1, score=1.000000 -   0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2, score=0.980392 -   0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2, score=0.941176 -   0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2, score=0.979167 -   0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=1, features__univ_select__k=2, score=1.000000 -   0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=1, features__univ_select__k=2, score=0.960784 -   0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=1, features__univ_select__k=2, score=0.979167 -   0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=10, features__univ_select__k=2, score=1.000000 -   0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=10, features__univ_select__k=2, score=0.921569 -   0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2
[CV]  features__pca__n_components=3, svm__C=10, features__univ_select__k=2, score=1.000000 -   0.0s
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, n_components=2, whiten=False)), ('univ_select', SelectKBest(k=2, score_func=<function f_classif at 0x7f8c6948dc08>))],
       transformer_weights=None)), ('svm', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
[Parallel(n_jobs=1)]: Done  31 tasks       | elapsed:    0.2s
[Parallel(n_jobs=1)]: Done  40 tasks       | elapsed:    0.3s
[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.4s
[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed:    0.4s finished

License

Author:

    Andreas Mueller (amueller@ais.uni-bonn.de)

License:

    BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.