Show Sidebar Hide Sidebar

Model Selection with Probabilistic PCA and Factor Analysis (FA) in Scikit-learn

Probabilistic PCA and Factor Analysis are probabilistic models. The consequence is that the likelihood of new data can be used for model selection and covariance estimation. Here we compare PCA and FA with cross-validation on low rank data corrupted with homoscedastic noise (noise variance is the same for each feature) or heteroscedastic noise (noise variance is the different for each feature). In a second step we compare the model likelihood to the likelihoods obtained from shrinkage covariance estimators.

One can observe that with homoscedastic noise both FA and PCA succeed in recovering the size of the low rank subspace. The likelihood with PCA is higher than FA in this case. However PCA fails and overestimates the rank when heteroscedastic noise is present. Under appropriate circumstances the low rank models are more likely than shrinkage models.

The automatic estimation from Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604 by Thomas P. Minka is also compared.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18'

Imports

In [2]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np
from scipy import linalg

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.covariance import ShrunkCovariance, LedoitWolf
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

Calculations

Create the data

In [3]:
n_samples, n_features, rank = 1000, 50, 10
sigma = 1.
rng = np.random.RandomState(42)
U, _, _ = linalg.svd(rng.randn(n_features, n_features))
X = np.dot(rng.randn(n_samples, rank), U[:, :rank].T)

# Adding homoscedastic noise
X_homo = X + sigma * rng.randn(n_samples, n_features)

# Adding heteroscedastic noise
sigmas = sigma * rng.rand(n_features) + sigma / 2.
X_hetero = X + rng.randn(n_samples, n_features) * sigmas

Fit the models

In [4]:
n_components = np.arange(0, n_features, 5)  # options for n_components


def compute_scores(X):
    pca = PCA(svd_solver='full')
    fa = FactorAnalysis()

    pca_scores, fa_scores = [], []
    for n in n_components:
        pca.n_components = n
        fa.n_components = n
        pca_scores.append(np.mean(cross_val_score(pca, X)))
        fa_scores.append(np.mean(cross_val_score(fa, X)))

    return pca_scores, fa_scores


def shrunk_cov_score(X):
    shrinkages = np.logspace(-2, 0, 30)
    cv = GridSearchCV(ShrunkCovariance(), {'shrinkage': shrinkages})
    return np.mean(cross_val_score(cv.fit(X).best_estimator_, X))


def lw_score(X):
    return np.mean(cross_val_score(LedoitWolf(), X))

Plot Results

In [5]:
data = [[],[]]
i = 0

for X, title in [(X_homo, 'Homoscedastic Noise'),
                 (X_hetero, 'Heteroscedastic Noise')]:
    pca_scores, fa_scores = compute_scores(X)
    n_components_pca = n_components[np.argmax(pca_scores)]
    n_components_fa = n_components[np.argmax(fa_scores)]

    pca = PCA(svd_solver='full', n_components='mle')
    pca.fit(X)
    n_components_pca_mle = pca.n_components_
    
    print(title)
    print("best n_components by PCA CV = %d" % n_components_pca)
    print("best n_components by FactorAnalysis CV = %d" % n_components_fa)
    print("best n_components by PCA MLE = %d" % n_components_pca_mle)
    

    pcascores = go.Scatter(x=n_components, 
                            y=pca_scores, 
                            name='PCA scores',
                            mode='lines',
                            line=dict(color='blue')
                           )
    data[i].append(pcascores)
    
    fascores = go.Scatter(x=n_components, 
                           y=fa_scores, 
                           name='FA scores',
                           mode='lines',
                           line=dict(color='red')
                          )
    data[i].append(fascores)
    
    truth = go.Scatter(x=[rank, rank], 
                       y=[min(pca_scores)-1, max(fa_scores)+1], 
                       name='TRUTH: %d' % rank,
                       mode='lines',
                       line=dict(color='green', dash='dash')
                      )
    data[i].append(truth) 
    
    pca_cv = go.Scatter(x=[n_components_pca, n_components_pca],
                        y=[min(pca_scores)-1, max(fa_scores)+1],
                        name='PCA CV: %d' % n_components_pca,
                        mode='lines',
                        line=dict(color='blue', dash='dash')
                       )
    data[i].append(pca_cv)
    
    factor_analysis = go.Scatter(x=[n_components_fa, n_components_fa],
                                 y=[min(pca_scores)-1, max(fa_scores)+1],
                                 name='FactorAnalysis CV: %d' % n_components_fa,
                                 mode='lines',
                                 line=dict(color='red', dash='dash')
                                )
    data[i].append(factor_analysis)
    pca_mle = go.Scatter(x=[n_components_pca_mle, n_components_pca_mle],
                         y=[min(pca_scores)-1, max(fa_scores)+1],
                         name='PCA MLE: %d' % n_components_pca_mle,
                         mode='lines',
                         line=dict(color='black', dash='dash')
                        )
    data[i].append(pca_mle)
    # compare with other covariance estimators
    
    shrunk_covariance = go.Scatter(y=[shrunk_cov_score(X), shrunk_cov_score(X)],
                                   x=[min(n_components), max(n_components)],
                                   name='Shrunk Covariance MLE',
                                   mode='lines',
                                   line=dict(color='violet', dash='dash')
                                  )
    data[i].append(shrunk_covariance)
    
    ledoit_wolf = go.Scatter(y=[lw_score(X), lw_score(X)],
                             x=[min(n_components), max(n_components)],
                             name='LedoitWolf MLE' % n_components_pca_mle, 
                             mode='lines',
                             line=dict(color='orange', dash='dashdot')
                            )
    data[i].append(ledoit_wolf)
    
    i+=1
Homoscedastic Noise
best n_components by PCA CV = 10
best n_components by FactorAnalysis CV = 10
best n_components by PCA MLE = 10
Heteroscedastic Noise
best n_components by PCA CV = 40
best n_components by FactorAnalysis CV = 10
best n_components by PCA MLE = 38

Homoscedastic Noise

In [6]:
layout = go.Layout(title='Homoscedastic Noise',
                   xaxis=dict(title='nb of components'),
                   yaxis=dict(title='CV scores')
                  )
fig = go.Figure(data=data[0], layout = layout)

py.iplot(fig)
Out[6]:

Heteroscedastic Noise

In [7]:
layout = go.Layout(title='Heteroscedastic Noise',
                   xaxis=dict(title='nb of components'),
                   yaxis=dict(title='CV scores')
                  )
fig = go.Figure(data=data[1], layout=layout)
py.iplot(fig)
Out[7]:

License

Authors:

     Alexandre Gramfort

     Denis A. Engemann

License:

     BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.