Show Sidebar Hide Sidebar

# Scaling the Regularization Parameter for SVCs in Scikit-learn

The following example illustrates the effect of scaling the regularization parameter when using Support Vector Machines for classification. For SVC classification, we are interested in a risk minimization for the equation:

where

• is used to set the amount of regularization

• is a loss function of our samples and our model parameters.

• is a penalty function of our model parameters

If we consider the loss function to be the individual error per sample, then the data-fit term, or the sum of the error for each sample, will increase as we add more samples. The penalization term, however, will not increase. When using, for example, cross validation, to set the amount of regularization with C, there will be a different amount of samples between the main problem and the smaller problems within the folds of the cross validation.

Since our loss function is dependent on the amount of samples, the latter will influence the selected value of C. The question that arises is How do we optimally adjust C to account for the different amount of training samples? The figures below are used to illustrate the effect of scaling our C to compensate for the change in the number of samples, in the case of using an l1 penalty, as well as the l2 penalty.

### l1-penalty case¶

In the l1 case, theory says that prediction consistency (i.e. that under given hypothesis, the estimator learned predicts as well as a model knowing the true distribution) is not possible because of the bias of the l1. It does say, however, that model consistency, in terms of finding the right set of non-zero parameters as well as their signs, can be achieved by scaling C1.

### l2-penalty case¶

The theory says that in order to achieve prediction consistency, the penalty parameter should be kept constant as the number of samples grow.

### Simulations¶

The two figures below plot the values of C on the x-axis and the corresponding cross-validation scores on the y-axis, for several different fractions of a generated data-set.

In the l1 penalty case, the cross-validation-error correlates best with the test-error, when scaling our C with the number of samples, n, which can be seen in the first figure.

For the l2 penalty case, the best result comes from the case where C is not scaled.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18.1'

### Imports¶

This tutorial imports LinearSVC, ShuffleSplit, GridSearchCV and check_random_state.

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from sklearn.svm import LinearSVC
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.utils import check_random_state
from sklearn import datasets


### Calculations¶

In [3]:
rnd = check_random_state(1)

# set up dataset
n_samples = 100
n_features = 300


l1 data (only 5 informative features)

In [4]:
X_1, y_1 = datasets.make_classification(n_samples=n_samples,
n_features=n_features, n_informative=5,
random_state=1)


l2 data: non sparse, but less features

In [5]:
y_2 = np.sign(.5 - rnd.rand(n_samples))
X_2 = rnd.randn(n_samples, n_features / 5) + y_2[:, np.newaxis]
X_2 += 5 * rnd.randn(n_samples, n_features / 5)

In [6]:
clf_sets = [(LinearSVC(penalty='l1', loss='squared_hinge', dual=False,
tol=1e-3),
np.logspace(-2.3, -1.3, 10), X_1, y_1),
(LinearSVC(penalty='l2', loss='squared_hinge', dual=True,
tol=1e-4),
np.logspace(-4.5, -2, 10), X_2, y_2)]

colors = ['navy', 'cyan', 'darkorange']
lw = 2
data = []
titles= []


### Plot Results¶

In [7]:
for fignum, (clf, cs, X, y) in enumerate(clf_sets):
# set up the plot for each regressor
data.append([[],[]])

for k, train_size in enumerate(np.linspace(0.3, 0.7, 3)[::-1]):
param_grid = dict(C=cs)

# To get nice curve, we need a large number of iterations to
# reduce the variance
grid = GridSearchCV(clf, refit=False, param_grid=param_grid,
cv=ShuffleSplit(train_size=train_size,
n_splits=250, random_state=1))
grid.fit(X, y)
scores = grid.cv_results_['mean_test_score']

scales = [(1, 'No scaling'),
((n_samples * train_size), '1/n_samples'),
]

for subplotnum, (scaler, name) in enumerate(scales):

grid_cs = cs * float(scaler)  # scale the C's

trace = go.Scatter(x=grid_cs, y=scores,
name="fraction %.2f" %
train_size,
mode='lines',
line=dict(color=colors[k], width=lw))
data[fignum][subplotnum].append(trace)
titles.append('scaling=%s, penalty=%s, loss=%s' %
(name, clf.penalty, clf.loss))



### Plot l1-penalty¶

In [8]:
fig1 = tools.make_subplots(rows=2, cols=1,
subplot_titles=tuple(titles[ :2]))

for i in range(0, len(data[0][0])):
fig1.append_trace(data[0][0][i], 1, 1)

for i in range(0, len(data[0][1])):
fig1.append_trace(data[0][1][i], 2, 1)

for i in map(str, range(1, 3)):
y = 'yaxis' + i
x = 'xaxis' + i
fig1['layout'][y].update(title='CV Score')
fig1['layout'][x].update(title='C', type='log')
fig1['layout'].update(height=700)

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]


In [9]:
py.iplot(fig1)

Out[9]:

### Plot l2-penalty¶

In [10]:
fig2 = tools.make_subplots(rows=2, cols=1,
subplot_titles=tuple(titles[6 : 8]))

for i in range(0, len(data[1][0])):
fig2.append_trace(data[1][0][i], 1, 1)

for i in range(0, len(data[1][1])):
fig2.append_trace(data[1][1][i], 2, 1)

for i in map(str, range(1, 3)):
y = 'yaxis' + i
x = 'xaxis' + i
fig2['layout'][y].update(title='CV Score')
fig2['layout'][x].update(title='C', type='log')
fig2['layout'].update(height=700)

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]


In [11]:
py.iplot(fig2)

Out[11]:

Author:

     Andreas Mueller <amueller@ais.uni-bonn.de>

Jaques Grobler <jaques.grobler@inria.fr>



     BSD 3 clause