Show Sidebar Hide Sidebar

Classification of Text Documents using Sparse Features in Scikit-learn

This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can efficiently handle sparse matrices.

The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached.

The bar plot indicates the accuracy, training time (normalized) and test time (normalized) of each classifier.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18.1'

Imports

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

from __future__ import print_function

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics

Calculations

Display progress logs on stdout

In [3]:
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

Parse commandline arguments

In [4]:
op = OptionParser()
op.add_option("--report",
              action="store_true", dest="print_report",
              help="Print a detailed classification report.")
op.add_option("--chi2_select",
              action="store", type="int", dest="select_chi2",
              help="Select some number of features using a chi-squared test")
op.add_option("--confusion_matrix",
              action="store_true", dest="print_cm",
              help="Print the confusion matrix.")
op.add_option("--top10",
              action="store_true", dest="print_top10",
              help="Print ten most discriminative terms per class"
                   " for every classifier.")
op.add_option("--all_categories",
              action="store_true", dest="all_categories",
              help="Whether to use all categories or not.")
op.add_option("--use_hashing",
              action="store_true",
              help="Use a hashing vectorizer.")
op.add_option("--n_features",
              action="store", type=int, default=2 ** 16,
              help="n_features when using the hashing vectorizer.")
op.add_option("--filtered",
              action="store_true",
              help="Remove newsgroup information that is easily overfit: "
                   "headers, signatures, and quoting.")

op.print_help()
Usage: __main__.py [options]

Options:
  -h, --help            show this help message and exit
  --report              Print a detailed classification report.
  --chi2_select=SELECT_CHI2
                        Select some number of features using a chi-squared
                        test
  --confusion_matrix    Print the confusion matrix.
  --top10               Print ten most discriminative terms per class for
                        every classifier.
  --all_categories      Whether to use all categories or not.
  --use_hashing         Use a hashing vectorizer.
  --n_features=N_FEATURES
                        n_features when using the hashing vectorizer.
  --filtered            Remove newsgroup information that is easily overfit:
                        headers, signatures, and quoting.

To Get command line arguments add

(opts, args) = op.parse_args()

and set the following as:

all_categories = opts.all_categories

filtered = opts.filtered

use_hashing = opts.use_hashing

n_features = opts.n_features

select_chi2 = opts.select_chi2

print_cm = opts.print_cm

print_top10 = opts.print_top10

print_report = opts.print_report

For this tutorial we are taking these values as:

In [5]:
all_categories = True
filtered = True
use_hashing = True
n_features = 2 ** 16
select_chi2 = 10
print_cm = True
print_top10 = True
print_report = True

Load some categories from the training set

In [6]:
if all_categories:
    categories = None
else:
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]
In [7]:
if filtered:
    remove = ('headers', 'footers', 'quotes')
else:
    remove = ()

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')
Loading 20 newsgroups dataset for categories:
all
data loaded

Order of labels in target_names can be different from categories

In [8]:
target_names = data_train.target_names

def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6
In [9]:
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
11314 documents - 13.782MB (training set)
7532 documents - 8.262MB (test set)

Split a training set and a test set

In [10]:
y_train, y_test = data_train.target, data_test.target

print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
if use_hashing:
    vectorizer = HashingVectorizer(stop_words='english', non_negative=True,
                                   n_features=n_features)
    X_train = vectorizer.transform(data_train.data)
else:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    X_train = vectorizer.fit_transform(data_train.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
Extracting features from the training data using a sparse vectorizer
done in 3.209865s at 4.294MB/s
n_samples: 11314, n_features: 65536

Extracting features from the test data using the same vectorizer
done in 1.774443s at 4.656MB/s
n_samples: 7532, n_features: 65536

Mapping from integer feature name to original token string

In [11]:
if use_hashing:
    feature_names = None
else:
    feature_names = vectorizer.get_feature_names()

if select_chi2:
    print("Extracting %d best features by a chi-squared test" %
          select_chi2)
    t0 = time()
    ch2 = SelectKBest(chi2, k=select_chi2)
    X_train = ch2.fit_transform(X_train, y_train)
    X_test = ch2.transform(X_test)
    if feature_names:
        # keep selected feature names
        feature_names = [feature_names[i] for i
                         in ch2.get_support(indices=True)]
    print("done in %fs" % (time() - t0))
    print()

if feature_names:
    feature_names = np.asarray(feature_names)


def trim(s):
    """Trim string to fit on terminal (assuming 80-column display)"""
    return s if len(s) <= 80 else s[:77] + "..."
Extracting 10 best features by a chi-squared test
done in 0.427792s

Benchmark classifiers

In [12]:
def benchmark(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)

    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)

    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)

    if hasattr(clf, 'coef_'):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))

        if print_top10 and feature_names is not None:
            print("top 10 keywords per class:")
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
        print()

    if print_report:
        print("classification report:")
        print(metrics.classification_report(y_test, pred,
                                            target_names=target_names))

    if print_cm:
        print("confusion matrix:")
        print(metrics.confusion_matrix(y_test, pred))

    print()
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time
In [13]:
results = []
for clf, name in (
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
        (Perceptron(n_iter=50), "Perceptron"),
        (PassiveAggressiveClassifier(n_iter=50), "Passive Aggressive"),
        (KNeighborsClassifier(n_neighbors=10), "kNN"),
        (RandomForestClassifier(n_estimators=100), "Random forest")):
    print('=' * 80)
    print(name)
    results.append(benchmark(clf))

for penalty in ["l2", "l1"]:
    print('=' * 80)
    print("%s penalty" % penalty.upper())
    # Train Liblinear model
    results.append(benchmark(LinearSVC(loss='l2', penalty=penalty,
                                            dual=False, tol=1e-3)))

    # Train SGD model
    results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                           penalty=penalty)))
================================================================================
Ridge Classifier
________________________________________________________________________________
Training: 
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver='lsqr',
        tol=0.01)
train time: 0.216s
test time:  0.001s
accuracy:   0.186
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.53      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.34      0.45       396
         rec.motorcycles       0.85      0.25      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.07      0.96      0.12       399
               sci.crypt       0.67      0.41      0.50       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.00      0.00      0.00       396
               sci.space       0.56      0.30      0.39       394
  soc.religion.christian       0.48      0.42      0.45       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.88      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.27      0.19      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1   0 239   1   0   0   6  70   1   1
    0   0]
 [  0   0  32   0   0   0   0   2   0   0 329   4   0   0  20   0   1   1
    0   0]
 [  0   0 192   0   0   0   0   0   0   0 195   0   0   0   5   0   2   0
    0   0]
 [  0   0  38   0   0   0   0   0   0   0 340   3   0   0  11   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0   0 369   4   0   0   5   3   0   0
    0   0]
 [  0   0  64   0   0   0   0   0   1   0 312   8   0   0   8   1   0   1
    0   0]
 [  0   0  12   0   0   0   0  17   4   0 335   7   0   0  14   1   0   0
    0   0]
 [  0   0   5   0   0   0   0 133   2   0 252   1   0   0   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0  10 101   0 278   3   0   0   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   0 378   8   0   0   2   5   2   0
    0   0]
 [  0   0   1   0   0   0   0   0   4   0 382   7   0   0   0   4   1   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0 227 161   0   0   2   1   1   0
    0   0]
 [  0   0   6   0   0   0   0  12   1   0 351  15   0   0   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0   1   1   0 377   2   0   0   1  11   1   1
    0   0]
 [  0   0   0   0   0   0   0   1   0   0 272   1   0   0 118   2   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0   0 218   3   0   0   3 168   0   5
    0   0]
 [  0   0   2   0   0   0   0   1   0   0 285   2   0   0   2   8  63   1
    0   0]
 [  0   0   0   0   0   0   0   6   3   0 256   9   0   0   2   9   6  85
    0   0]
 [  0   0   4   0   0   0   0   4   0   0 273   2   0   0   1   5  21   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0 180   1   0   0   1  55   8   2
    0   0]]

================================================================================
Perceptron
________________________________________________________________________________
Training: 
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      n_iter=50, n_jobs=1, penalty=None, random_state=0, shuffle=True,
      verbose=0, warm_start=False)
train time: 0.417s
test time:  0.001s
accuracy:   0.129
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.22      0.11      0.15       319
           comp.graphics       0.04      0.01      0.02       389
 comp.os.ms-windows.misc       0.00      0.00      0.00       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.07      0.02      0.03       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.68      0.33      0.44       396
         rec.motorcycles       0.91      0.25      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.87      0.33      0.48       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.00      0.00      0.00       396
               sci.space       0.64      0.29      0.40       394
  soc.religion.christian       0.05      0.78      0.10       398
      talk.politics.guns       0.62      0.12      0.20       364
   talk.politics.mideast       0.23      0.27      0.25       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.22      0.13      0.13      7532

confusion matrix:
[[ 35   0   0   0   0   0   0   0   1   0   0   1   0   0   5 274   0   3
    0   0]
 [  0   4   0   0   0   7   0   0   0   0   0   0   0   0  10 328   0  39
    1   0]
 [  0  80   0   0   0   1   0   0   0   0   0   0   0   0   3 194   2 113
    1   0]
 [  0   5   0   0   0   4   0   0   0   0   0   1   0   0  10 340   0  32
    0   0]
 [  0   2   0   0   0   4   0   1   0   0   0   0   0   0   4 372   0   2
    0   0]
 [  0  16   0   0   0   7   0   0   0   0   0   1   0   0   5 313   0  53
    0   0]
 [  1   1   0   0   0   6   0  19   3   0   0   1   0   0   5 335   0  19
    0   0]
 [  0   0   0   0   0   1   0 131   1   0   0   0   0   0   1 252   2   5
    3   0]
 [  2   0   0   0   0   3   0  12  99   0   0   0   0   0   1 280   0   1
    0   0]
 [  0   0   0   0   0   5   0   1   1   0   0   4   0   0   2 382   1   1
    0   0]
 [  1   0   0   0   0   4   0   0   3   0   0   3   0   0   0 385   1   2
    0   0]
 [  0   0   0   0   0  32   0   1   0   0   0 132   0   0   0 228   1   1
    1   0]
 [  0   1   0   0   0  12   0  12   1   0   0   4   0   0   6 352   1   4
    0   0]
 [  4   0   0   0   0   1   0   1   0   0   0   1   0   0   0 383   0   6
    0   0]
 [  2   0   0   0   0   0   0   1   0   0   0   1   0   0 114 273   0   3
    0   0]
 [ 76   1   0   0   0   3   0   0   0   0   0   0   0   0   3 310   0   5
    0   0]
 [  3   0   0   0   0   2   0   2   0   0   0   0   0   0   4 290  43  20
    0   0]
 [  5   0   0   0   0   2   0   5   0   0   0   2   0   0   2 260   0 100
    0   0]
 [  2   1   0   0   0   2   0   5   0   0   0   0   0   0   1 275  12  12
    0   0]
 [ 29   0   1   0   0   0   0   3   0   0   0   1   0   0   1 203   6   7
    0   0]]

================================================================================
Passive Aggressive
________________________________________________________________________________
Training: 
PassiveAggressiveClassifier(C=1.0, class_weight=None, fit_intercept=True,
              loss='hinge', n_iter=50, n_jobs=1, random_state=None,
              shuffle=True, verbose=0, warm_start=False)
train time: 0.323s
test time:  0.001s
accuracy:   0.170
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.55      0.49      0.52       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.06      0.97      0.12       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       1.00      0.00      0.01       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.69      0.41      0.52       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.00      0.00      0.00       396
               sci.space       0.58      0.29      0.39       394
  soc.religion.christian       0.50      0.42      0.46       398
      talk.politics.guns       0.54      0.17      0.26       364
   talk.politics.mideast       0.83      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.29      0.17      0.16      7532

confusion matrix:
[[  0   0   0   0   0   0   0 239   1   0   0   1   0   0   6  68   2   2
    0   0]
 [  0   0  23   0   0   3   0 339   0   0   0   3   0   0  18   0   1   2
    0   0]
 [  0   0 192   0   0   0   0 195   0   0   0   0   0   0   4   0   3   0
    0   0]
 [  0   0  38   0   0   0   0 341   0   0   0   3   0   0  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0 370   0   0   0   4   0   0   5   3   0   0
    0   0]
 [  0   0  60   0   0   0   0 316   1   0   0   8   0   0   8   1   0   1
    0   0]
 [  0   0  13   0   0   0   0 353   4   0   0   7   0   0  12   1   0   0
    0   0]
 [  0   0   5   0   0   0   0 384   3   0   0   1   0   0   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0 285 105   0   0   2   0   0   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0 379   1   1   0   8   0   0   2   4   2   0
    0   0]
 [  0   0   1   0   0   0   0 382   4   0   0   7   0   0   0   3   2   0
    0   0]
 [  0   0   0   0   0   0   0 228   0   0   0 163   0   0   2   1   1   1
    0   0]
 [  0   0   5   0   0   0   0 364   1   0   0  15   0   0   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0 378   1   0   0   2   0   0   1  11   1   1
    0   0]
 [  0   0   1   0   0   0   0 273   0   0   0   2   0   0 116   2   0   0
    0   0]
 [  0   0   1   0   0   0   0 220   0   0   0   1   0   0   3 168   0   5
    0   0]
 [  0   0   2   0   0   0   0 287   0   0   0   1   0   0   2   8  63   1
    0   0]
 [  0   0   0   0   0   0   0 265   3   0   0   6   0   0   2   7   6  87
    0   0]
 [  0   0   3   0   0   0   0 277   0   0   0   2   0   0   1   5  22   0
    0   0]
 [  0   0   3   0   0   0   0 183   0   0   0   1   0   0   1  50   8   5
    0   0]]

================================================================================
kNN
________________________________________________________________________________
Training: 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')
train time: 0.001s
test time:  0.804s
accuracy:   0.161
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.23      0.02      0.03       319
           comp.graphics       0.43      0.01      0.02       389
 comp.os.ms-windows.misc       0.06      0.96      0.12       394
comp.sys.ibm.pc.hardware       0.05      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.25      0.03      0.05       395
            misc.forsale       0.50      0.02      0.03       390
               rec.autos       0.72      0.31      0.44       396
         rec.motorcycles       0.89      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.12      0.00      0.00       399
               sci.crypt       0.72      0.39      0.50       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.20      0.00      0.00       396
               sci.space       0.61      0.29      0.39       394
  soc.religion.christian       0.48      0.39      0.43       398
      talk.politics.guns       0.58      0.17      0.26       364
   talk.politics.mideast       0.87      0.25      0.39       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.34      0.16      0.16      7532

confusion matrix:
[[  5   0 238   1   0   0   0   0   1   0   0   1   0   0   5  64   1   1
    0   2]
 [  0   3 347   1   1  11   1   1   0   0   0   3   0   0  17   0   1   3
    0   0]
 [  0   2 379   3   0   4   0   0   0   0   0   0   0   0   4   0   2   0
    0   0]
 [  0   0 374   1   0   4   1   0   0   0   0   2   0   0  10   0   0   0
    0   0]
 [  1   0 372   0   0   1   0   1   0   0   0   3   0   0   5   2   0   0
    0   0]
 [  0   2 364   2   0  10   1   0   0   0   0   8   0   0   6   1   0   1
    0   0]
 [  0   0 344   2   0   3   6  16   4   0   0   6   1   0   7   1   0   0
    0   0]
 [  0   0 257   6   0   1   0 124   3   0   1   1   0   0   1   0   2   0
    0   0]
 [  0   0 278   1   0   0   0   6 102   0   0   4   0   1   1   3   2   0
    0   0]
 [  0   0 378   0   1   0   1   1   1   0   0   6   0   0   2   4   2   0
    0   1]
 [  0   0 383   0   0   0   0   0   3   0   1   6   1   0   0   4   1   0
    0   0]
 [  0   0 232   1   1   0   1   1   0   2   2 153   0   0   0   2   1   0
    0   0]
 [  0   0 355   0   1   2   0  12   1   0   0  14   0   0   6   1   1   0
    0   0]
 [  1   0 376   1   0   2   0   0   0   0   0   1   1   1   0  10   0   3
    0   0]
 [  0   0 272   0   0   1   1   1   0   0   0   1   1   0 114   3   0   0
    0   0]
 [ 11   0 219   0   0   0   0   0   0   0   1   0   1   0   3 156   0   4
    0   3]
 [  1   0 287   0   0   0   0   1   0   0   1   1   0   0   2   7  62   2
    0   0]
 [  0   0 258   1   0   0   0   4   0   0   2   1   0   2   2   9   2  95
    0   0]
 [  0   0 276   0   0   0   0   3   0   0   0   2   0   1   1   6  21   0
    0   0]
 [  3   0 181   0   0   1   0   2   0   0   0   0   0   0   1  55   8   0
    0   0]]

================================================================================
Random forest
________________________________________________________________________________
Training: 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
train time: 1.167s
test time:  0.335s
accuracy:   0.167
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.16      0.03      0.05       319
           comp.graphics       0.18      0.02      0.04       389
 comp.os.ms-windows.misc       0.57      0.33      0.42       394
comp.sys.ibm.pc.hardware       0.10      0.01      0.02       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.23      0.04      0.07       395
            misc.forsale       0.25      0.03      0.05       390
               rec.autos       0.68      0.26      0.37       396
         rec.motorcycles       0.76      0.25      0.38       398
      rec.sport.baseball       0.06      0.95      0.12       397
        rec.sport.hockey       0.09      0.01      0.01       399
               sci.crypt       0.79      0.38      0.51       396
         sci.electronics       0.06      0.00      0.00       393
                 sci.med       0.11      0.00      0.00       396
               sci.space       0.65      0.23      0.34       394
  soc.religion.christian       0.47      0.26      0.33       398
      talk.politics.guns       0.44      0.14      0.21       364
   talk.politics.mideast       0.77      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.19      0.04      0.07       251

             avg / total       0.34      0.17      0.17      7532

confusion matrix:
[[ 10   0   1   0   0   0   0   0   1 238   2   2   0   0   5  48   1   3
    1   7]
 [  0   8  11   7   1   6   3   2   0 328   1   2   1   1  11   0   4   2
    1   0]
 [  0  17 131  11   1  25   7   1   0 194   0   1   0   0   5   0   1   0
    0   0]
 [  0   5  19   4   2   7   3   0   2 340   0   2   1   0   6   0   0   0
    1   0]
 [  1   0   2   1   0   1   0   0   1 371   0   3   0   0   3   2   0   0
    0   0]
 [  0   8  37   6   0  16   2   1   0 312   0   1   1   2   2   0   3   3
    0   1]
 [  0   1   9   1   0   3  12  11   6 334   0   5   1   0   3   1   1   1
    1   0]
 [  1   0   4   1   0   1   6 102  16 252   3   0   3   0   2   2   1   2
    0   0]
 [  2   1   0   1   0   0   1   7 101 280   1   1   0   1   0   1   1   0
    0   0]
 [  0   0   0   0   0   1   3   1   0 378   1   5   0   0   0   2   2   0
    1   3]
 [  0   0   1   0   0   0   0   0   3 382   2   4   1   0   0   2   2   0
    1   1]
 [  2   0   1   0   1   4   0   1   0 227   3 149   2   0   1   1   1   1
    1   1]
 [  0   0   5   1   1   2   1  11   1 351   1   9   1   0   4   1   3   0
    1   0]
 [  2   0   0   0   0   0   0   0   0 376   2   1   1   1   1   4   2   3
    1   2]
 [  1   3   3   0   2   2   6   1   1 272   0   1   2   0  90   2   3   1
    4   0]
 [ 26   0   2   4   0   0   0   1   0 218   2   0   1   0   3 102   8   4
    0  27]
 [  2   0   2   0   0   0   3   6   0 285   0   1   0   1   1   6  51   4
    2   0]
 [  5   0   1   0   0   0   0   1   0 256   4   1   1   2   0   9   2  88
    4   2]
 [  2   1   1   1   0   1   0   2   1 271   0   1   1   1   0   6  19   0
    0   2]
 [  8   0   1   3   0   0   1   2   0 180   0   0   0   0   1  30  10   2
    2  11]]

================================================================================
L2 penalty
________________________________________________________________________________
Training: 
LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='l2', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.001, verbose=0)
train time: 0.176s
test time:  0.002s
accuracy:   0.186
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.67      0.41      0.51       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.06      0.95      0.12       396
               sci.space       0.57      0.29      0.39       394
  soc.religion.christian       0.49      0.42      0.45       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.87      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.27      0.19      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1   1   0   1   0 238   6  70   1   1
    0   0]
 [  0   0  32   0   0   1   0   2   0   0   0   4   0 328  20   0   1   1
    0   0]
 [  0   0 193   0   0   0   0   0   0   1   0   0   0 194   4   0   2   0
    0   0]
 [  0   0  39   0   0   0   0   0   0   0   0   3   0 340  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0   0   0   4   0 369   5   3   0   0
    0   0]
 [  0   0  64   0   0   0   0   0   1   0   0   8   0 312   8   1   0   1
    0   0]
 [  0   0  13   0   0   0   0  17   4   0   0   7   0 335  13   1   0   0
    0   0]
 [  0   0   5   0   0   0   0 132   3   0   0   1   0 252   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0   7 105   0   0   2   0 278   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   0   0   8   0 378   2   5   2   0
    0   0]
 [  0   0   1   0   0   0   0   0   4   0   0   7   0 382   0   4   1   0
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0 162   0 227   2   1   1   0
    0   0]
 [  0   0   6   0   0   0   0  12   1   0   0  15   0 351   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0   1   1   1   0   2   0 376   1  11   1   1
    0   0]
 [  0   0   1   0   0   0   0   1   0   0   0   2   0 272 116   2   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0   0   0   3   0 218   3 168   0   5
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0   2   0 285   2   8  63   1
    0   0]
 [  0   0   0   0   0   0   0   6   3   1   0   8   0 256   2   8   6  86
    0   0]
 [  0   0   4   0   0   0   0   4   0   2   0   2   0 271   1   5  21   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0   0   1   0 180   1  54   8   3
    0   0]]

________________________________________________________________________________
Training: 
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
train time: 0.487s
test time:  0.001s
accuracy:   0.186
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.50       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.34      0.45       396
         rec.motorcycles       0.85      0.26      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.66      0.41      0.51       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.06      0.95      0.12       396
               sci.space       0.57      0.30      0.39       394
  soc.religion.christian       0.49      0.42      0.45       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.87      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.27      0.19      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1   0   0   2   0 238   6  70   1   1
    0   0]
 [  0   0  34   0   0   0   0   2   0   0   0   4   0 328  19   0   1   1
    0   0]
 [  0   0 193   0   0   0   0   0   0   0   0   0   0 194   4   0   3   0
    0   0]
 [  0   0  39   0   0   0   0   0   0   0   0   3   0 340  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0   0   0   4   0 369   5   3   0   0
    0   0]
 [  0   0  64   0   0   0   0   0   1   0   0   8   0 312   8   1   0   1
    0   0]
 [  0   0  13   0   0   0   0  18   4   0   0   7   0 334  13   1   0   0
    0   0]
 [  0   0   5   0   0   0   0 133   2   0   0   1   0 252   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0  10 102   0   0   2   0 278   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   0   0   8   0 378   2   5   2   0
    0   0]
 [  0   0   1   0   0   0   0   0   4   0   0   7   0 382   0   4   1   0
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0 162   0 227   2   1   1   0
    0   0]
 [  0   0   6   0   0   0   0  12   1   0   0  15   0 351   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0   1   1   0   0   3   0 376   1  11   1   1
    0   0]
 [  0   0   1   0   0   0   0   1   0   0   0   1   0 272 117   2   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0   0   0   3   0 218   3 168   0   5
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0   2   0 285   2   8  63   1
    0   0]
 [  0   0   0   0   0   0   0   6   3   0   0   9   0 256   2   8   6  86
    0   0]
 [  0   0   4   0   0   0   0   4   0   0   0   3   0 271   1   6  21   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0   0   1   0 180   1  54   8   3
    0   0]]

================================================================================
L1 penalty
________________________________________________________________________________
Training: 
LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='l2', max_iter=1000, multi_class='ovr',
     penalty='l1', random_state=None, tol=0.001, verbose=0)
train time: 0.023s
test time:  0.001s
accuracy:   0.186
dimensionality: 10
density: 0.990000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.67      0.41      0.51       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.06      0.95      0.12       396
               sci.space       0.57      0.29      0.39       394
  soc.religion.christian       0.49      0.42      0.45       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.84      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.27      0.19      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1   1   0   1   0 238   6  69   1   2
    0   0]
 [  0   0  33   0   0   1   0   2   0   0   0   4   0 328  19   0   1   1
    0   0]
 [  0   0 193   0   0   0   0   0   0   1   0   0   0 194   4   0   2   0
    0   0]
 [  0   0  39   0   0   0   0   0   0   0   0   3   0 340  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0   0   0   4   0 369   5   3   0   0
    0   0]
 [  0   0  64   0   0   0   0   0   1   0   0   8   0 312   8   1   0   1
    0   0]
 [  0   0  13   0   0   0   0  17   4   0   0   7   0 335  13   1   0   0
    0   0]
 [  0   0   5   0   0   0   0 132   3   0   0   1   0 252   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0   7 105   0   0   2   0 278   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   0   0   8   0 378   2   5   2   0
    0   0]
 [  0   0   1   0   0   0   0   0   4   0   0   7   0 382   0   4   1   0
    0   0]
 [  0   0   0   0   0   0   0   1   0   0   0 163   0 227   2   1   1   1
    0   0]
 [  0   0   6   0   0   0   0  12   1   0   0  15   0 351   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0   1   1   1   0   2   0 376   1  11   1   1
    0   0]
 [  0   0   1   0   0   0   0   1   0   0   0   2   0 272 116   2   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0   0   0   3   0 218   3 168   0   5
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0   2   0 285   2   8  63   1
    0   0]
 [  0   0   0   0   0   0   0   6   3   2   0   7   0 256   2   7   6  87
    0   0]
 [  0   0   4   0   0   0   0   4   0   2   0   2   0 271   1   5  21   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0   0   1   0 180   1  52   8   5
    0   0]]

________________________________________________________________________________
Training: 
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
       penalty='l1', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
train time: 0.433s
test time:  0.001s
accuracy:   0.162
dimensionality: 10
density: 0.455000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.06      0.98      0.12       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.66      0.41      0.50       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.00      0.00      0.00       396
               sci.space       0.57      0.29      0.39       394
  soc.religion.christian       0.49      0.42      0.46       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.84      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.24      0.16      0.15      7532

confusion matrix:
[[  0   0 238   0   0   0   0   0   1   0   0   2   0   0   6  68   2   2
    0   0]
 [  0   0 362   0   0   0   0   2   0   0   0   4   0   0  19   0   1   1
    0   0]
 [  0   0 386   0   0   0   0   1   0   0   0   1   0   0   4   0   2   0
    0   0]
 [  0   0 379   0   0   0   0   0   0   0   0   3   0   0  10   0   0   0
    0   0]
 [  0   0 372   0   0   0   0   1   0   0   0   4   0   0   5   3   0   0
    0   0]
 [  0   0 377   0   0   0   0   0   1   0   0   7   0   0   8   1   0   1
    0   0]
 [  0   0 347   0   0   0   0  17   4   0   0   7   0   0  14   1   0   0
    0   0]
 [  0   0 257   0   0   0   0 132   3   0   0   1   0   0   1   0   2   0
    0   0]
 [  0   0 278   0   0   0   0   7 105   0   0   2   0   0   1   3   2   0
    0   0]
 [  0   0 378   0   0   0   0   1   1   0   0   8   0   0   2   5   2   0
    0   0]
 [  0   0 383   0   0   0   0   0   4   0   0   7   0   0   0   4   1   0
    0   0]
 [  0   0 229   0   0   0   0   1   0   0   0 161   0   0   2   1   1   1
    0   0]
 [  0   0 357   0   0   0   0  12   1   0   0  15   0   0   6   1   1   0
    0   0]
 [  0   0 377   0   0   0   0   1   1   0   0   3   0   0   1  11   1   1
    0   0]
 [  0   0 273   0   0   0   0   1   0   0   0   2   0   0 116   2   0   0
    0   0]
 [  0   0 219   0   0   0   0   0   0   0   0   3   0   0   3 168   0   5
    0   0]
 [  0   0 287   0   0   0   0   1   0   0   0   2   0   0   2   8  63   1
    0   0]
 [  0   0 256   0   0   0   0   6   3   0   0   9   0   0   2   7   6  87
    0   0]
 [  0   0 275   0   0   0   0   4   0   0   0   3   0   0   1   6  21   0
    0   0]
 [  0   0 182   0   0   0   0   3   0   0   0   1   0   0   1  51   8   5
    0   0]]

Train SGD with Elastic Net penalty

In [14]:
print('=' * 80)
print("Elastic-Net penalty")
results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                       penalty="elasticnet")))
================================================================================
Elastic-Net penalty
________________________________________________________________________________
Training: 
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
       penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
train time: 0.488s
test time:  0.001s
accuracy:   0.180
dimensionality: 10
density: 0.845000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.06      0.84      0.11       389
 comp.os.ms-windows.misc       0.52      0.49      0.50       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.34      0.45       396
         rec.motorcycles       0.85      0.26      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.66      0.41      0.51       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.00      0.00      0.00       396
               sci.space       0.57      0.30      0.39       394
  soc.religion.christian       0.49      0.42      0.45       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.87      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.27      0.18      0.18      7532

confusion matrix:
[[  0 238   0   0   0   0   0   0   1   0   0   2   0   0   6  70   1   1
    0   0]
 [  0 328  34   0   0   0   0   2   0   0   0   4   0   0  19   0   1   1
    0   0]
 [  0 194 193   0   0   0   0   0   0   0   0   0   0   0   4   0   3   0
    0   0]
 [  0 340  39   0   0   0   0   0   0   0   0   3   0   0  10   0   0   0
    0   0]
 [  0 369   3   0   0   0   0   1   0   0   0   4   0   0   5   3   0   0
    0   0]
 [  0 312  64   0   0   0   0   0   1   0   0   8   0   0   8   1   0   1
    0   0]
 [  0 334  13   0   0   0   0  18   4   0   0   7   0   0  13   1   0   0
    0   0]
 [  0 252   5   0   0   0   0 133   2   0   0   1   0   0   1   0   2   0
    0   0]
 [  0 278   0   0   0   0   0  10 102   0   0   2   0   0   1   3   2   0
    0   0]
 [  0 378   0   0   0   0   0   1   1   0   0   8   0   0   2   5   2   0
    0   0]
 [  0 382   1   0   0   0   0   0   4   0   0   7   0   0   0   4   1   0
    0   0]
 [  0 227   2   0   0   0   0   1   0   0   0 162   0   0   2   1   1   0
    0   0]
 [  0 351   6   0   0   0   0  12   1   0   0  15   0   0   6   1   1   0
    0   0]
 [  0 376   1   0   0   0   0   1   1   0   0   3   0   0   1  11   1   1
    0   0]
 [  0 272   1   0   0   0   0   1   0   0   0   1   0   0 117   2   0   0
    0   0]
 [  0 218   1   0   0   0   0   0   0   0   0   3   0   0   3 168   0   5
    0   0]
 [  0 285   2   0   0   0   0   1   0   0   0   2   0   0   2   8  63   1
    0   0]
 [  0 256   0   0   0   0   0   6   3   0   0   9   0   0   2   8   6  86
    0   0]
 [  0 271   4   0   0   0   0   4   0   0   0   3   0   0   1   6  21   0
    0   0]
 [  0 180   2   0   0   0   0   2   0   0   0   1   0   0   1  54   8   3
    0   0]]

Train NearestCentroid without threshold

In [15]:
print('=' * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))
================================================================================
NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training: 
NearestCentroid(metric='euclidean', shrink_threshold=None)
train time: 0.020s
test time:  0.002s
accuracy:   0.180
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.17      0.01      0.02       319
           comp.graphics       1.00      0.01      0.02       389
 comp.os.ms-windows.misc       0.57      0.48      0.52       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.22      0.02      0.04       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.34      0.45       396
         rec.motorcycles       0.90      0.25      0.39       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.17      0.00      0.00       399
               sci.crypt       0.71      0.34      0.46       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.06      0.95      0.12       396
               sci.space       0.61      0.29      0.40       394
  soc.religion.christian       0.50      0.36      0.42       398
      talk.politics.guns       0.58      0.16      0.26       364
   talk.politics.mideast       0.90      0.23      0.36       376
      talk.politics.misc       0.03      0.00      0.01       310
      talk.religion.misc       0.18      0.04      0.06       251

             avg / total       0.37      0.18      0.18      7532

confusion matrix:
[[  3   0   0   0   0   0   0   0   1   0   0   0   1 238   5  58   1   1
    2   9]
 [  0   3  21   1   3  11   0   0   0   0   0   3   2 328  15   0   1   1
    0   0]
 [  0   0 188   0   0   5   0   0   0   0   0   0   0 194   4   0   2   0
    1   0]
 [  0   0  33   0   0   5   0   0   0   0   0   1   1 341  11   0   0   0
    0   0]
 [  1   0   3   0   0   0   0   1   0   0   0   4   0 369   5   0   0   0
    0   2]
 [  0   0  56   1   2   8   0   0   0   0   0   7   1 313   5   1   0   1
    0   0]
 [  0   0  13   0   7   0   0  18   4   0   0   6   1 334   6   1   0   0
    0   0]
 [  0   0   5   0   0   1   0 133   1   0   0   1   0 252   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0  11  99   0   1   3   0 278   1   4   1   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   0   0   5   1 378   2   4   2   0
    2   1]
 [  0   0   1   0   0   0   0   0   3   0   1   6   0 382   0   3   1   0
    2   0]
 [  0   0   2   0   0   0   0   2   0   0   0 133   3 228   2   2   2   0
   22   0]
 [  0   0   3   0   0   3   0  12   1   0   0  15   0 351   6   1   1   0
    0   0]
 [  1   0   0   0   1   1   0   0   0   1   1   1   3 376   0   9   0   0
    1   1]
 [  0   0   0   0   1   0   0   1   0   0   0   0   0 272 116   3   0   0
    1   0]
 [  6   0   1   0   0   0   0   0   0   0   0   0   3 218   3 142   0   5
    0  20]
 [  0   0   2   0   0   0   1   2   0   0   0   1   1 285   3   6  60   1
    0   2]
 [  0   0   0   0   0   0   2   3   0   0   3   0   7 258   2   6   2  85
    5   3]
 [  1   0   2   1   0   1   0   4   0   1   0   1   1 271   1   3  20   0
    1   2]
 [  6   0   1   0   0   1   0   2   0   0   0   0   0 180   1  43   8   0
    0   9]]

Train sparse Naive Bayes classifiers

In [16]:
print('=' * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=.01)))
results.append(benchmark(BernoulliNB(alpha=.01)))
================================================================================
Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
train time: 0.009s
test time:  0.001s
accuracy:   0.183
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.19      0.01      0.01       389
 comp.os.ms-windows.misc       0.54      0.47      0.50       394
comp.sys.ibm.pc.hardware       0.33      0.00      0.01       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.27      0.01      0.01       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.71      0.32      0.44       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.08      0.01      0.02       397
        rec.sport.hockey       0.07      0.96      0.12       399
               sci.crypt       0.79      0.36      0.49       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.14      0.00      0.00       396
               sci.space       0.59      0.29      0.39       394
  soc.religion.christian       0.48      0.42      0.45       398
      talk.politics.guns       0.60      0.17      0.26       364
   talk.politics.mideast       0.90      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.34      0.18      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1   0 239   1   0   0   5  71   1   1
    0   0]
 [  0   3  29   0   4   2   1   1   0   2 328   2   0   1  14   0   1   1
    0   0]
 [  0   5 184   0   0   3   1   0   0   0 195   0   0   0   4   0   2   0
    0   0]
 [  0   1  35   1   0   2   0   0   0   2 340   1   0   0  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0   3 369   1   0   0   5   3   0   0
    0   0]
 [  0   4  58   0   0   3   0   0   0   0 313   7   0   1   7   1   0   1
    0   0]
 [  0   1  12   0   0   0   0  17   4   5 335   2   1   0  12   1   0   0
    0   0]
 [  0   0   5   0   0   0   3 127   3   0 252   1   2   0   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0   7 104   2 278   1   0   0   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   5 378   4   0   0   3   4   1   0
    0   0]
 [  0   0   1   0   0   0   0   0   4   4 382   3   0   0   0   4   1   0
    0   0]
 [  0   0   0   2   2   0   0   1   0  17 228 142   0   1   0   2   1   0
    0   0]
 [  0   1   4   0   0   1   0  12   1   7 351   8   0   0   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0   0   1   1 379   1   1   1   0  11   0   0
    0   0]
 [  0   1   0   0   1   0   0   1   0   0 272   1   0   1 115   2   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0   4 218   0   0   1   3 167   0   4
    0   0]
 [  0   0   2   0   0   0   1   1   0   2 285   0   0   0   3   8  61   1
    0   0]
 [  0   0   0   0   0   0   0   5   3   3 261   4   2   0   3   9   1  85
    0   0]
 [  0   0   4   0   0   0   0   3   0   2 273   0   1   0   1   6  20   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0 181   0   0   1   1  55   8   1
    0   0]]

________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, class_prior=None, fit_prior=True)
train time: 0.006s
test time:  0.001s
accuracy:   0.184
dimensionality: 10
density: 1.000000

classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.50       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.69      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.06      0.95      0.12       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.65      0.41      0.50       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.00      0.00      0.00       396
               sci.space       0.58      0.29      0.39       394
  soc.religion.christian       0.47      0.42      0.44       398
      talk.politics.guns       0.53      0.16      0.24       364
   talk.politics.mideast       0.87      0.22      0.35       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.25      0.00      0.01       251

             avg / total       0.28      0.18      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1 238   0   2   0   0   5  70   2   1
    0   0]
 [  0   0  36   0   0   0   0   2   0 328   0   4   0   0  16   0   1   2
    0   0]
 [  0   0 193   0   0   0   0   1   0 194   0   1   0   0   3   0   2   0
    0   0]
 [  0   0  39   0   0   0   0   0   0 340   0   3   0   0  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0 369   0   5   0   0   4   3   0   0
    0   0]
 [  0   0  65   0   0   0   0   0   1 312   0   7   0   0   8   1   0   1
    0   0]
 [  0   0  13   0   0   0   0  18   4 334   0   7   0   0  12   2   0   0
    0   0]
 [  0   0   5   0   0   0   1 130   3 252   0   1   0   0   1   0   3   0
    0   0]
 [  0   0   0   0   0   0   0   7 105 278   0   2   0   0   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1 378   0   8   0   0   3   4   1   0
    0   1]
 [  0   0   1   0   0   0   0   0   4 382   0   7   0   0   0   4   1   0
    0   0]
 [  0   0   0   0   0   0   0   0   0 227   0 162   0   0   2   2   2   1
    0   0]
 [  0   0   6   0   0   0   0  11   1 351   0  15   0   0   6   1   2   0
    0   0]
 [  0   0   1   0   0   0   0   1   1 376   0   3   0   0   1  11   1   1
    0   0]
 [  0   0   2   0   0   0   0   2   0 272   0   1   0   0 114   3   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0 218   0   3   0   0   2 166   1   5
    0   2]
 [  0   0   2   0   0   0   0   1   0 285   0   5   0   0   5   8  57   1
    0   0]
 [  0   0   0   0   0   0   0   6   3 256   0   9   0   0   3  11   5  83
    0   0]
 [  0   0   4   0   0   0   0   4   0 271   0   4   0   0   1   6  20   0
    0   0]
 [  0   0   3   0   0   0   0   3   0 180   0   0   0   0   1  55   8   0
    0   1]]

LinearSVC with L1-based feature selection

In [17]:
print('=' * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(benchmark(Pipeline([
  ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)),
  ('classification', LinearSVC())
])))
================================================================================
LinearSVC with L1-based feature selection
________________________________________________________________________________
Training: 
Pipeline(steps=[('feature_selection', LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.001,
     verbose=0)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])
train time: 0.436s
test time:  0.002s
accuracy:   0.186
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.00      0.00      0.00       319
           comp.graphics       0.00      0.00      0.00       389
 comp.os.ms-windows.misc       0.52      0.49      0.51       394
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       392
   comp.sys.mac.hardware       0.00      0.00      0.00       385
          comp.windows.x       0.00      0.00      0.00       395
            misc.forsale       0.00      0.00      0.00       390
               rec.autos       0.70      0.33      0.45       396
         rec.motorcycles       0.85      0.26      0.40       398
      rec.sport.baseball       0.00      0.00      0.00       397
        rec.sport.hockey       0.00      0.00      0.00       399
               sci.crypt       0.67      0.41      0.51       396
         sci.electronics       0.00      0.00      0.00       393
                 sci.med       0.06      0.95      0.12       396
               sci.space       0.57      0.29      0.39       394
  soc.religion.christian       0.49      0.42      0.45       398
      talk.politics.guns       0.56      0.17      0.26       364
   talk.politics.mideast       0.87      0.23      0.36       376
      talk.politics.misc       0.00      0.00      0.00       310
      talk.religion.misc       0.00      0.00      0.00       251

             avg / total       0.27      0.19      0.18      7532

confusion matrix:
[[  0   0   0   0   0   0   0   0   1   1   0   1   0 238   6  70   1   1
    0   0]
 [  0   0  32   0   0   1   0   2   0   0   0   4   0 328  20   0   1   1
    0   0]
 [  0   0 193   0   0   0   0   0   0   1   0   0   0 194   4   0   2   0
    0   0]
 [  0   0  39   0   0   0   0   0   0   0   0   3   0 340  10   0   0   0
    0   0]
 [  0   0   3   0   0   0   0   1   0   0   0   4   0 369   5   3   0   0
    0   0]
 [  0   0  64   0   0   0   0   0   1   0   0   8   0 312   8   1   0   1
    0   0]
 [  0   0  13   0   0   0   0  17   4   0   0   7   0 335  13   1   0   0
    0   0]
 [  0   0   5   0   0   0   0 132   3   0   0   1   0 252   1   0   2   0
    0   0]
 [  0   0   0   0   0   0   0   7 105   0   0   2   0 278   1   3   2   0
    0   0]
 [  0   0   0   0   0   0   0   1   1   0   0   8   0 378   2   5   2   0
    0   0]
 [  0   0   1   0   0   0   0   0   4   0   0   7   0 382   0   4   1   0
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0 162   0 227   2   1   1   0
    0   0]
 [  0   0   6   0   0   0   0  12   1   0   0  15   0 351   6   1   1   0
    0   0]
 [  0   0   1   0   0   0   0   1   1   1   0   2   0 376   1  11   1   1
    0   0]
 [  0   0   1   0   0   0   0   1   0   0   0   2   0 272 116   2   0   0
    0   0]
 [  0   0   1   0   0   0   0   0   0   0   0   3   0 218   3 168   0   5
    0   0]
 [  0   0   2   0   0   0   0   1   0   0   0   2   0 285   2   8  63   1
    0   0]
 [  0   0   0   0   0   0   0   6   3   1   0   8   0 256   2   8   6  86
    0   0]
 [  0   0   4   0   0   0   0   4   0   2   0   2   0 271   1   5  21   0
    0   0]
 [  0   0   2   0   0   0   0   2   0   0   0   1   0 180   1  54   8   3
    0   0]]

Plot Results

In [18]:
indices = np.arange(len(results))

results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)

p1 = go.Bar(x=indices, y=score, 
            name="score", 
            marker=dict(color='navy'))

p2 = go.Bar(x=indices + 2, y=training_time, 
            name="training time",
            marker=dict(color='cyan'))

p3 = go.Bar(x=indices + 4, y=test_time, 
            name="test time", 
            marker=dict(color='darkorange'))


layout = go.Layout(title="Score")
fig = go.Figure(data=[p1, p2, p3], layout=layout)
In [19]:
py.iplot(fig)
Out[19]:

License

Authors:

     Peter Prettenhofer <peter.prettenhofer@gmail.com>

     Olivier Grisel <olivier.grisel@ensta.org>

     Mathieu Blondel <mathieu@mblondel.org>

     Lars Buitinck

License:

     BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.