Show Sidebar Hide Sidebar

# Wikipedia Principal Eigenvector in Scikit-learn

A classical way to assert the relative importance of vertices in a graph is to compute the principal eigenvector of the adjacency matrix so as to assign to each vertex the values of the components of the first eigenvector as a centrality score:

https://en.wikipedia.org/wiki/Eigenvector_centrality

On the graph of webpages and links those values are called the PageRank scores by Google. The goal of this example is to analyze the graph of links inside wikipedia articles to rank articles by relative importance according to this eigenvector centrality. The traditional way to compute the principal eigenvector is to use the power iteration method:

https://en.wikipedia.org/wiki/Power_iteration

Here the computation is achieved thanks to Martinssonâ€™s Randomized SVD algorithm implemented in the scikit. The graph data is fetched from the DBpedia dumps. DBpedia is an extraction of the latent structured data of the Wikipedia content.

#### New to Plotly?¶

You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

### Version¶

In [1]:
import sklearn
sklearn.__version__

Out[1]:
'0.18'

### Imports¶

In [2]:
from __future__ import print_function

from bz2 import BZ2File

import os
from datetime import datetime
from pprint import pprint
from time import time

import numpy as np

from scipy import sparse

from sklearn.decomposition import randomized_svd
from sklearn.externals.joblib import Memory
from sklearn.externals.six.moves.urllib.request import urlopen
from sklearn.externals.six import iteritems

print(__doc__)

Automatically created module for IPython interactive environment


### Calculations¶

In [3]:
redirects_url = "http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2"
redirects_filename = redirects_url.rsplit("/", 1)[1]

resources = [
(redirects_url, redirects_filename),
]

for url, filename in resources:
if not os.path.exists(filename):
opener = urlopen(url)
print()


In [4]:
memory = Memory(cachedir=".")

def index(redirects, index_map, k):
"""Find the index of an article name after redirect resolution"""
k = redirects.get(k, k)
return index_map.setdefault(k, len(index_map))

DBPEDIA_RESOURCE_PREFIX_LEN = len("http://dbpedia.org/resource/")
SHORTNAME_SLICE = slice(DBPEDIA_RESOURCE_PREFIX_LEN + 1, -1)

def short_name(nt_uri):
"""Remove the < and > URI markers and the common URI prefix"""
return nt_uri[SHORTNAME_SLICE]

def get_redirects(redirects_filename):
"""Parse the redirections and build a transitively closed map out of it"""
redirects = {}
print("Parsing the NT redirect file")
for l, line in enumerate(BZ2File(redirects_filename)):
split = line.split()
if len(split) != 4:
print("ignoring malformed line: " + line)
continue
redirects[short_name(split[0])] = short_name(split[2])
if l % 1000000 == 0:
print("[%s] line: %08d" % (datetime.now().isoformat(), l))

# compute the transitive closure
print("Computing the transitive closure of the redirect relation")
for l, source in enumerate(redirects.keys()):
transitive_target = None
target = redirects[source]
seen = set([source])
while True:
transitive_target = target
target = redirects.get(target)
if target is None or target in seen:
break
redirects[source] = transitive_target
if l % 1000000 == 0:
print("[%s] line: %08d" % (datetime.now().isoformat(), l))

return redirects

# disabling joblib as the pickling of large dicts seems much too slow
#@memory.cache
"""Extract the adjacency graph as a scipy sparse matrix

Redirects are resolved first.

Returns X, the scipy sparse adjacency matrix, redirects as python
dict from article names to article names and index_map a python dict
from article names to python int (article indexes).
"""

print("Computing the redirect map")
redirects = get_redirects(redirects_filename)

print("Computing the integer index map")
index_map = dict()
split = line.split()
if len(split) != 4:
print("ignoring malformed line: " + line)
continue
i = index(redirects, index_map, short_name(split[0]))
j = index(redirects, index_map, short_name(split[2]))
if l % 1000000 == 0:
print("[%s] line: %08d" % (datetime.now().isoformat(), l))

if limit is not None and l >= limit - 1:
break

X = sparse.lil_matrix((len(index_map), len(index_map)), dtype=np.float32)
X[i, j] = 1.0
print("Converting to CSR representation")
X = X.tocsr()
print("CSR conversion done")
return X, redirects, index_map

# stop after 5M links to make it possible to work in RAM
names = dict((i, name) for name, i in iteritems(index_map))

print("Computing the principal singular vectors using randomized_svd")
t0 = time()
U, s, V = randomized_svd(X, 5, n_iter=3)
print("done in %0.3fs" % (time() - t0))

# print the names of the wikipedia related strongest components of the
# principal singular vector which should be similar to the highest eigenvector
pprint([names[i] for i in np.abs(U.T[0]).argsort()[-10:]])
pprint([names[i] for i in np.abs(V[0]).argsort()[-10:]])

def centrality_scores(X, alpha=0.85, max_iter=100, tol=1e-10):
"""Power iteration computation of the principal eigenvector

This method is also known as Google PageRank and the implementation
is based on the one from the NetworkX project (BSD licensed too)

Aric Hagberg <hagberg@lanl.gov>
Dan Schult <dschult@colgate.edu>
Pieter Swart <swart@lanl.gov>
"""
n = X.shape[0]
X = X.copy()
incoming_counts = np.asarray(X.sum(axis=1)).ravel()

print("Normalizing the graph")
for i in incoming_counts.nonzero()[0]:
X.data[X.indptr[i]:X.indptr[i + 1]] *= 1.0 / incoming_counts[i]
dangle = np.asarray(np.where(X.sum(axis=1) == 0, 1.0 / n, 0)).ravel()

scores = np.ones(n, dtype=np.float32) / n  # initial guess
for i in range(max_iter):
print("power iteration #%d" % i)
prev_scores = scores
scores = (alpha * (scores * X + np.dot(dangle, prev_scores))
+ (1 - alpha) * prev_scores.sum() / n)
# check convergence: normalized l_inf norm
scores_max = np.abs(scores).max()
if scores_max == 0.0:
scores_max = 1.0
err = np.abs(scores - prev_scores).max() / scores_max
print("error: %0.6f" % err)
if err < n * tol:
return scores

return scores

print("Computing principal eigenvector score using a power iteration method")
t0 = time()
scores = centrality_scores(X, max_iter=100, tol=1e-10)
print("done in %0.3fs" % (time() - t0))
pprint([names[i] for i in np.abs(scores).argsort()[-10:]])

Computing the redirect map
Parsing the NT redirect file
[2016-11-03T22:06:59.153450] line: 00000000
[2016-11-03T22:07:08.892231] line: 01000000
[2016-11-03T22:07:19.669552] line: 02000000
[2016-11-03T22:07:27.035018] line: 03000000
[2016-11-03T22:07:33.749849] line: 04000000
Computing the transitive closure of the redirect relation
[2016-11-03T22:07:34.456301] line: 00000000
[2016-11-03T22:07:35.584369] line: 01000000
[2016-11-03T22:07:36.678192] line: 02000000
[2016-11-03T22:07:37.803847] line: 03000000
[2016-11-03T22:07:38.943277] line: 04000000
Computing the integer index map
[2016-11-03T22:07:39.158285] line: 00000000
[2016-11-03T22:07:47.741059] line: 01000000
[2016-11-03T22:07:54.770050] line: 02000000
[2016-11-03T22:08:01.567734] line: 03000000
[2016-11-03T22:08:08.569797] line: 04000000
Converting to CSR representation
CSR conversion done
Computing the principal singular vectors using randomized_svd
done in 6.959s
['1989',
'1971',
'1975',
'1970',
'2006',
'1972',
'1996',
'1966',
'1967',
'2007']
['Soviet_Union',
'Spain',
'Italy',
'Japan',
'Germany',
'World_War_II',
'France',
'United_Kingdom',
'United_States']
Computing principal eigenvector score using a power iteration method
Normalizing the graph
power iteration #0
error: 0.988448
power iteration #1
error: 0.495765
power iteration #2
error: 0.304177
power iteration #3
error: 0.206610
power iteration #4
error: 0.149243
power iteration #5
error: 0.112218
power iteration #6
error: 0.086735
power iteration #7
error: 0.068370
power iteration #8
error: 0.054681
power iteration #9
error: 0.044216
power iteration #10
error: 0.036059
power iteration #11
error: 0.029603
power iteration #12
error: 0.024433
power iteration #13
error: 0.020252
power iteration #14
error: 0.016845
power iteration #15
error: 0.014051
power iteration #16
error: 0.011748
power iteration #17
error: 0.009841
power iteration #18
error: 0.008257
power iteration #19
error: 0.006937
power iteration #20
error: 0.005834
power iteration #21
error: 0.004912
power iteration #22
error: 0.004138
power iteration #23
error: 0.003489
power iteration #24
error: 0.002943
power iteration #25
error: 0.002484
power iteration #26
error: 0.002097
power iteration #27
error: 0.001771
power iteration #28
error: 0.001496
power iteration #29
error: 0.001264
power iteration #30
error: 0.001068
power iteration #31
error: 0.000903
power iteration #32
error: 0.000763
power iteration #33
error: 0.000645
power iteration #34
error: 0.000546
power iteration #35
error: 0.000461
power iteration #36
error: 0.000390
power iteration #37
error: 0.000330
power iteration #38
error: 0.000279
power iteration #39
error: 0.000236
power iteration #40
error: 0.000200
power iteration #41
error: 0.000169
power iteration #42
error: 0.000143
power iteration #43
error: 0.000121
power iteration #44
error: 0.000102
done in 1.726s
['Africa',
'Animal',
'New_York',
'Jews',
'Philosophy',
'Byzantine_Empire',
'New_York_City',
'World_War_I',
'France',
'United_States']


    Olivier Grisel <olivier.grisel@ensta.org>

    BSD 3 clause