Show Sidebar Hide Sidebar

Randomly Generated Multilabel Dataset in Scikit-learn

This illustrates the datasets.make_multilabel_classification dataset generator. Each sample consists of counts of two features (up to 50 in total), which are differently distributed in each of two classes.

Points are labeled as follows, where Y means the class is present:

1 2 3 Color
Y N N Red
N Y N Blue
N N Y Yellow
Y Y N Purple
Y N Y Orange
Y Y N Green
Y Y Y Brown

A big circle marks the expected sample for each class; its size reflects the probability of selecting that class label.

The left and right examples highlight the n_labels parameter: more of the samples in the right plot have 2 or 3 labels.

Note that this two-dimensional example is very degenerate: generally the number of features would be much greater than the “document length”, while here we have much larger documents than vocabulary. Similarly, with n_classes > n_features, it is much less likely that a feature distinguishes a particular class.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18'

Imports

This tutorial imports make_ml_clf.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_multilabel_classification as make_ml_clf

Calculations

In [3]:
COLORS = np.array(['!',
                   '#FF3333',  # red
                   '#0198E1',  # blue
                   '#BF5FFF',  # purple
                   '#FCD116',  # yellow
                   '#FF7216',  # orange
                   '#4DBD33',  # green
                   '#87421F'   # brown
                   ])

# Use same random seed for multiple calls to make_multilabel_classification to
# ensure same distributions
RANDOM_SEED = np.random.randint(2 ** 10)

def plot_2d(n_labels=1, n_classes=3, length=50):
    X, Y, p_c, p_w_c = make_ml_clf(n_samples=150, n_features=2,
                                   n_classes=n_classes, n_labels=n_labels,
                                   length=length, allow_unlabeled=False,
                                   return_distributions=True,
                                   random_state=RANDOM_SEED)

    trace1 = go.Scatter(x=X[:, 0], y=X[:, 1], 
                        mode='markers',
                        showlegend=False,
                        marker=dict(size=8,
                                    color=COLORS.take((Y * [1, 2, 4]).sum(axis=1)))
                        )
    trace2 = go.Scatter(x=p_w_c[0] * length, y=p_w_c[1] * length,
                        mode='markers',
                        showlegend=False,
                        marker=dict(color=COLORS.take([1, 2, 4]),
                                    size=14,
                                    line=dict(width=1, color='black'))
                        )
    
    data = [trace1, trace2]
    return data, p_c, p_w_c

Plot Results

n_labels=1

In [4]:
data, p_c, p_w_c = plot_2d(n_labels=1)

layout=go.Layout(title='n_labels=1, length=50',
                 xaxis=dict(title='Feature 0 count',
                            showgrid=False),
                 yaxis=dict(title='Feature 1 count',
                            showgrid=False),
                )

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Out[4]:

n_labels=3

In [5]:
data = plot_2d(n_labels=3)

layout=go.Layout(title='n_labels=3, length=50',
                 xaxis=dict(title='Feature 0 count',
                            showgrid=False),
                 yaxis=dict(title='Feature 1 count',
                            showgrid=False),
                )

fig = go.Figure(data=data[0], layout=layout)
py.iplot(fig)
Out[5]:
In [6]:
print('The data was generated from (random_state=%d):' % RANDOM_SEED)
print('Class', 'P(C)', 'P(w0|C)', 'P(w1|C)', sep='\t')
for k, p, p_w in zip(['red', 'blue', 'yellow'], p_c, p_w_c.T):
    print('%s\t%0.2f\t%0.2f\t%0.2f' % (k, p, p_w[0], p_w[1]))
The data was generated from (random_state=701):
Class	P(C)	P(w0|C)	P(w1|C)
red	0.11	0.66	0.34
blue	0.59	0.52	0.48
yellow	0.30	0.66	0.34
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.