Show Sidebar Hide Sidebar

# FeatureHasher and DictVectorizer Comparison in Scikit-learn

Compares FeatureHasher and DictVectorizer by using both to vectorize text documents.

The example demonstrates syntax and speed only; it doesnâ€™t actually do anything useful with the extracted vectors. See the example scripts {document_classification_20newsgroups,clustering}.py for actual learning on text documents.

A discrepancy between the number of terms reported for DictVectorizer and for FeatureHasher is to be expected due to hash collisions.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports fetch_20newsgroups, DictVectorizer and FeatureHasher.

In [2]:
from __future__ import print_function
from collections import defaultdict
import re
import sys
from time import time

import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction import DictVectorizer, FeatureHasher


### Calculations¶

In [3]:
def n_nonzero_columns(X):
"""Returns the number of non-zero columns in a CSR matrix X."""
return len(np.unique(X.nonzero()[1]))

def tokens(doc):
"""Extract tokens from doc.

This uses a simple regex to break strings into tokens. For a more
principled approach, see CountVectorizer or TfidfVectorizer.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))

def token_freqs(doc):
"""Extract a dict mapping tokens from doc to their frequencies."""
freq = defaultdict(int)
for tok in tokens(doc):
freq[tok] += 1
return freq

categories = [
'alt.atheism',
'comp.graphics',
'comp.sys.ibm.pc.hardware',
'misc.forsale',
'rec.autos',
'sci.space',
'talk.religion.misc',
]


To use a larger set (11k+ documents) set categories = None

The default number of features is 2**18.

In [4]:
n_features = 2 ** 18

raw_data = fetch_20newsgroups(subset='train', categories=categories).data
data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6
print("%d documents - %0.3fMB" % (len(raw_data), data_size_mb))
print()

Loading 20 newsgroups training data
3803 documents - 6.245MB


In [5]:
print("DictVectorizer")
t0 = time()
vectorizer = DictVectorizer()
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % len(vectorizer.get_feature_names()))
print()

DictVectorizer
done in 1.647244s at 3.791MB/s
Found 47918 unique terms


In [6]:
print("FeatureHasher on frequency dicts")
t0 = time()
hasher = FeatureHasher(n_features=n_features)
X = hasher.transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % n_nonzero_columns(X))
print()

FeatureHasher on frequency dicts
done in 1.319619s at 4.732MB/s
Found 43865 unique terms


In [7]:
print("FeatureHasher on raw tokens")
t0 = time()
hasher = FeatureHasher(n_features=n_features, input_type="string")
X = hasher.transform(tokens(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % n_nonzero_columns(X))

FeatureHasher on raw tokens
done in 1.444398s at 4.323MB/s
Found 43865 unique terms


Author:

    Lars Buitinck



    BSD 3 clause