Show Sidebar Hide Sidebar

# Compare Cross Decomposition Methods in Scikit-learn

Simple usage of various cross decomposition algorithms: - PLSCanonical - PLSRegression, with multivariate response, a.k.a. PLS2 - PLSRegression, with univariate response, a.k.a. PLS1 - CCA

Given 2 multivariate covarying two-dimensional datasets, X, and Y, PLS extracts the â€˜directions of covarianceâ€™, i.e. the components of each datasets that explain the most shared variance between both datasets. This is apparent on the scatterplot matrix display: components 1 in dataset X and dataset Y are maximally correlated (points lie around the first diagonal). This is also true for components 2 in both dataset, however, the correlation across datasets for different components is weak: the point cloud is very spherical.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

This tutorails imports PLSCanonical, PLSRegression and CCA.

In [2]:
print(__doc__)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSCanonical, PLSRegression, CCA

Automatically created module for IPython interactive environment


### Calculations¶

Dataset based latent variables model

In [3]:
n = 500
# 2 latents vars:
l1 = np.random.normal(size=n)
l2 = np.random.normal(size=n)

latents = np.array([l1, l1, l2, l2]).T
X = latents + np.random.normal(size=4 * n).reshape((n, 4))
Y = latents + np.random.normal(size=4 * n).reshape((n, 4))

X_train = X[:n / 2]
Y_train = Y[:n / 2]
X_test = X[n / 2:]
Y_test = Y[n / 2:]

print("Corr(X)")
print(np.round(np.corrcoef(X.T), 2))
print("Corr(Y)")
print(np.round(np.corrcoef(Y.T), 2))

Corr(X)
[[ 1.    0.5  -0.06 -0.1 ]
[ 0.5   1.    0.02 -0.01]
[-0.06  0.02  1.    0.43]
[-0.1  -0.01  0.43  1.  ]]
Corr(Y)
[[ 1.    0.48 -0.11 -0.06]
[ 0.48  1.   -0.03  0.01]
[-0.11 -0.03  1.    0.51]
[-0.06  0.01  0.51  1.  ]]


### Canonical (symmetric) PLS¶

Transform Data

In [4]:
plsca = PLSCanonical(n_components=2)
plsca.fit(X_train, Y_train)
X_train_r, Y_train_r = plsca.transform(X_train, Y_train)
X_test_r, Y_test_r = plsca.transform(X_test, Y_test)


Scatter plot of scores

In [5]:
fig = tools.make_subplots(rows=2, cols=2,
print_grid=False,
subplot_titles=('Comp. 1: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 0], Y_test_r[:, 0])[0, 1],
'X comp. 1 vs X comp. 2 (test corr = %.2f)'
% np.corrcoef(X_test_r[:, 0], X_test_r[:, 1])[0, 1],
'Y comp. 1 vs Y comp. 2 , (test corr = %.2f)'
% np.corrcoef(Y_test_r[:, 0], Y_test_r[:, 1])[0, 1],
'Comp. 2: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 1], Y_test_r[:, 1])[0, 1]))

# 1) On diagonal plot X vs Y scores on each components

comp1 = go.Scatter(x=X_train_r[:, 0],
y=Y_train_r[:, 0],
name="train",
mode='markers',
marker=dict(color='red',
line=dict(color='black', width=1))
)
comp1_ = go.Scatter(x=X_test_r[:, 0],
y=Y_test_r[:, 0],
name="test",
mode='markers',
marker=dict(color='green',
line=dict(color='black', width=1))
)
fig.append_trace(comp1, 1, 1)
fig.append_trace(comp1_, 1, 1)

fig['layout']['xaxis1'].update(title='x scores', zeroline=False,
showgrid=False)
fig['layout']['yaxis1'].update(title='y scores', zeroline=False,
showgrid=False)

comp2 = go.Scatter(x=X_train_r[:, 1],
y=Y_train_r[:, 1],
name="train",
showlegend=False,
mode='markers',
marker=dict(color='red',
line=dict(color='black', width=1))
)

comp2_ = go.Scatter(x=X_test_r[:, 1],
y=Y_test_r[:, 1],
name="test",
showlegend=False,
mode='markers',
marker=dict(color='green',
line=dict(color='black', width=1))
)
fig.append_trace(comp2, 2, 2)
fig.append_trace(comp2_, 2, 2)

fig['layout']['xaxis4'].update(title='x scores', zeroline=False,
showgrid=False)
fig['layout']['yaxis4'].update(title='y scores', zeroline=False,
showgrid=False)

# 2) Off diagonal plot components 1 vs 2 for X and Y

xcomp = go.Scatter(x=X_train_r[:, 0],
y=X_train_r[:, 1],
name="train",
showlegend=False,
mode='markers',
marker=dict(color='red',
line=dict(color='black', width=1))
)
xcomp_ = go.Scatter(x=X_test_r[:, 0],
y=X_test_r[:, 1],
name="test",
showlegend=False,
mode='markers',
marker=dict(color='green',
line=dict(color='black', width=1))
)

fig.append_trace(xcomp, 1, 2)
fig.append_trace(xcomp_, 1, 2)

fig['layout']['xaxis2'].update(title='X comp. 1', zeroline=False,
showgrid=False)
fig['layout']['yaxis2'].update(title='X comp. 2', zeroline=False,
showgrid=False)

ycomp1 = go.Scatter(x=Y_train_r[:, 0],
y=Y_train_r[:, 1],
name="train",
showlegend=False,
mode='markers',
marker=dict(color='red',
line=dict(color='black', width=1))
)

ycomp1_ = go.Scatter(x=Y_test_r[:, 0],
y=Y_test_r[:, 1],
name="test",
showlegend=False,
mode='markers',
marker=dict(color='green',
line=dict(color='black', width=1))
)
fig.append_trace(ycomp1, 2, 1)
fig.append_trace(ycomp1_, 2, 1)

fig['layout']['xaxis3'].update(title='Y comp. 1', zeroline=False,
showgrid=False)
fig['layout']['yaxis3'].update(title='Y comp. 2', zeroline=False,
showgrid=False)

fig['layout'].update(height=800)

In [6]:
py.iplot(fig)

Out[6]:

### PLS Regression¶

PLS regression, with multivariate response, a.k.a. PLS2

In [7]:
n = 1000
q = 3
p = 10
X = np.random.normal(size=n * p).reshape((n, p))
B = np.array([[1, 2] + [0] * (p - 2)] * q).T
# each Yj = 1*X1 + 2*X2 + noize
Y = np.dot(X, B) + np.random.normal(size=n * q).reshape((n, q)) + 5

pls2 = PLSRegression(n_components=3)
pls2.fit(X, Y)
print("True B (such that: Y = XB + Err)")
print(B)
# compare pls2.coef_ with B
print("Estimated B")
print(np.round(pls2.coef_, 1))
pls2.predict(X)

True B (such that: Y = XB + Err)
[[1 1 1]
[2 2 2]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]]
Estimated B
[[ 1.   1.   1. ]
[ 2.   2.   2. ]
[ 0.   0.   0. ]
[ 0.   0.  -0. ]
[-0.  -0.1 -0. ]
[ 0.   0.   0. ]
[-0.   0.  -0. ]
[-0.  -0.  -0. ]
[-0.  -0.  -0. ]
[ 0.1  0.1 -0. ]]

Out[7]:
array([[ 2.2696561 ,  2.32455797,  2.40508248],
[ 9.78256978,  9.95096021,  9.88771175],
[ 6.86635142,  6.99968069,  6.7605165 ],
...,
[ 6.75327401,  6.87061734,  6.72441019],
[ 7.57312605,  7.71579299,  7.56704522],
[ 3.11627201,  3.19632039,  3.08518641]])

PLS regression, with univariate response, a.k.a. PLS1

In [8]:
n = 1000
p = 10
X = np.random.normal(size=n * p).reshape((n, p))
y = X[:, 0] + 2 * X[:, 1] + np.random.normal(size=n * 1) + 5
pls1 = PLSRegression(n_components=3)
pls1.fit(X, y)
# note that the number of components exceeds 1 (the dimension of y)
print("Estimated betas")
print(np.round(pls1.coef_, 1))

Estimated betas
[[ 1. ]
[ 2. ]
[-0. ]
[ 0. ]
[-0. ]
[-0.1]
[-0. ]
[-0. ]
[-0. ]
[ 0. ]]


CCA (PLS mode B with symmetric deflation)

In [9]:
cca = CCA(n_components=2)
cca.fit(X_train, Y_train)
X_train_r, Y_train_r = plsca.transform(X_train, Y_train)
X_test_r, Y_test_r = plsca.transform(X_test, Y_test)

Still need help?