Show Sidebar Hide Sidebar

Wikipedia Principal Eigenvector in Scikit-learn

A classical way to assert the relative importance of vertices in a graph is to compute the principal eigenvector of the adjacency matrix so as to assign to each vertex the values of the components of the first eigenvector as a centrality score:

https://en.wikipedia.org/wiki/Eigenvector_centrality

On the graph of webpages and links those values are called the PageRank scores by Google. The goal of this example is to analyze the graph of links inside wikipedia articles to rank articles by relative importance according to this eigenvector centrality. The traditional way to compute the principal eigenvector is to use the power iteration method:

https://en.wikipedia.org/wiki/Power_iteration

Here the computation is achieved thanks to Martinsson’s Randomized SVD algorithm implemented in the scikit. The graph data is fetched from the DBpedia dumps. DBpedia is an extraction of the latent structured data of the Wikipedia content.

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version

In [1]:
import sklearn
sklearn.__version__
Out[1]:
'0.18'

Imports

In [2]:
from __future__ import print_function

from bz2 import BZ2File


import os
from datetime import datetime
from pprint import pprint
from time import time

import numpy as np

from scipy import sparse

from sklearn.decomposition import randomized_svd
from sklearn.externals.joblib import Memory
from sklearn.externals.six.moves.urllib.request import urlopen
from sklearn.externals.six import iteritems


print(__doc__)
Automatically created module for IPython interactive environment

Calculations

Where to download the data, if not already on disk

In [3]:
redirects_url = "http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2"
redirects_filename = redirects_url.rsplit("/", 1)[1]

page_links_url = "http://downloads.dbpedia.org/3.5.1/en/page_links_en.nt.bz2"
page_links_filename = page_links_url.rsplit("/", 1)[1]

resources = [
    (redirects_url, redirects_filename),
    (page_links_url, page_links_filename),
]

for url, filename in resources:
    if not os.path.exists(filename):
        print("Downloading data from '%s', please wait..." % url)
        opener = urlopen(url)
        open(filename, 'wb').write(opener.read())
        print()

Loading the redirect files

In [4]:
memory = Memory(cachedir=".")


def index(redirects, index_map, k):
    """Find the index of an article name after redirect resolution"""
    k = redirects.get(k, k)
    return index_map.setdefault(k, len(index_map))


DBPEDIA_RESOURCE_PREFIX_LEN = len("http://dbpedia.org/resource/")
SHORTNAME_SLICE = slice(DBPEDIA_RESOURCE_PREFIX_LEN + 1, -1)


def short_name(nt_uri):
    """Remove the < and > URI markers and the common URI prefix"""
    return nt_uri[SHORTNAME_SLICE]


def get_redirects(redirects_filename):
    """Parse the redirections and build a transitively closed map out of it"""
    redirects = {}
    print("Parsing the NT redirect file")
    for l, line in enumerate(BZ2File(redirects_filename)):
        split = line.split()
        if len(split) != 4:
            print("ignoring malformed line: " + line)
            continue
        redirects[short_name(split[0])] = short_name(split[2])
        if l % 1000000 == 0:
            print("[%s] line: %08d" % (datetime.now().isoformat(), l))

    # compute the transitive closure
    print("Computing the transitive closure of the redirect relation")
    for l, source in enumerate(redirects.keys()):
        transitive_target = None
        target = redirects[source]
        seen = set([source])
        while True:
            transitive_target = target
            target = redirects.get(target)
            if target is None or target in seen:
                break
            seen.add(target)
        redirects[source] = transitive_target
        if l % 1000000 == 0:
            print("[%s] line: %08d" % (datetime.now().isoformat(), l))

    return redirects


# disabling joblib as the pickling of large dicts seems much too slow
#@memory.cache
def get_adjacency_matrix(redirects_filename, page_links_filename, limit=None):
    """Extract the adjacency graph as a scipy sparse matrix

    Redirects are resolved first.

    Returns X, the scipy sparse adjacency matrix, redirects as python
    dict from article names to article names and index_map a python dict
    from article names to python int (article indexes).
    """

    print("Computing the redirect map")
    redirects = get_redirects(redirects_filename)

    print("Computing the integer index map")
    index_map = dict()
    links = list()
    for l, line in enumerate(BZ2File(page_links_filename,'rb')):
        split = line.split()
        if len(split) != 4:
            print("ignoring malformed line: " + line)
            continue
        i = index(redirects, index_map, short_name(split[0]))
        j = index(redirects, index_map, short_name(split[2]))
        links.append((i, j))
        if l % 1000000 == 0:
            print("[%s] line: %08d" % (datetime.now().isoformat(), l))

        if limit is not None and l >= limit - 1:
            break

    print("Computing the adjacency matrix")
    X = sparse.lil_matrix((len(index_map), len(index_map)), dtype=np.float32)
    for i, j in links:
        X[i, j] = 1.0
    del links
    print("Converting to CSR representation")
    X = X.tocsr()
    print("CSR conversion done")
    return X, redirects, index_map


# stop after 5M links to make it possible to work in RAM
X, redirects, index_map = get_adjacency_matrix(
    redirects_filename, page_links_filename, limit=5000000)
names = dict((i, name) for name, i in iteritems(index_map))

print("Computing the principal singular vectors using randomized_svd")
t0 = time()
U, s, V = randomized_svd(X, 5, n_iter=3)
print("done in %0.3fs" % (time() - t0))

# print the names of the wikipedia related strongest components of the
# principal singular vector which should be similar to the highest eigenvector
print("Top wikipedia pages according to principal singular vectors")
pprint([names[i] for i in np.abs(U.T[0]).argsort()[-10:]])
pprint([names[i] for i in np.abs(V[0]).argsort()[-10:]])


def centrality_scores(X, alpha=0.85, max_iter=100, tol=1e-10):
    """Power iteration computation of the principal eigenvector

    This method is also known as Google PageRank and the implementation
    is based on the one from the NetworkX project (BSD licensed too)
    with copyrights by:

      Aric Hagberg <hagberg@lanl.gov>
      Dan Schult <dschult@colgate.edu>
      Pieter Swart <swart@lanl.gov>
    """
    n = X.shape[0]
    X = X.copy()
    incoming_counts = np.asarray(X.sum(axis=1)).ravel()

    print("Normalizing the graph")
    for i in incoming_counts.nonzero()[0]:
        X.data[X.indptr[i]:X.indptr[i + 1]] *= 1.0 / incoming_counts[i]
    dangle = np.asarray(np.where(X.sum(axis=1) == 0, 1.0 / n, 0)).ravel()

    scores = np.ones(n, dtype=np.float32) / n  # initial guess
    for i in range(max_iter):
        print("power iteration #%d" % i)
        prev_scores = scores
        scores = (alpha * (scores * X + np.dot(dangle, prev_scores))
                  + (1 - alpha) * prev_scores.sum() / n)
        # check convergence: normalized l_inf norm
        scores_max = np.abs(scores).max()
        if scores_max == 0.0:
            scores_max = 1.0
        err = np.abs(scores - prev_scores).max() / scores_max
        print("error: %0.6f" % err)
        if err < n * tol:
            return scores

    return scores

print("Computing principal eigenvector score using a power iteration method")
t0 = time()
scores = centrality_scores(X, max_iter=100, tol=1e-10)
print("done in %0.3fs" % (time() - t0))
pprint([names[i] for i in np.abs(scores).argsort()[-10:]])
Computing the redirect map
Parsing the NT redirect file
[2016-11-03T22:06:59.153450] line: 00000000
[2016-11-03T22:07:08.892231] line: 01000000
[2016-11-03T22:07:19.669552] line: 02000000
[2016-11-03T22:07:27.035018] line: 03000000
[2016-11-03T22:07:33.749849] line: 04000000
Computing the transitive closure of the redirect relation
[2016-11-03T22:07:34.456301] line: 00000000
[2016-11-03T22:07:35.584369] line: 01000000
[2016-11-03T22:07:36.678192] line: 02000000
[2016-11-03T22:07:37.803847] line: 03000000
[2016-11-03T22:07:38.943277] line: 04000000
Computing the integer index map
[2016-11-03T22:07:39.158285] line: 00000000
[2016-11-03T22:07:47.741059] line: 01000000
[2016-11-03T22:07:54.770050] line: 02000000
[2016-11-03T22:08:01.567734] line: 03000000
[2016-11-03T22:08:08.569797] line: 04000000
Computing the adjacency matrix
Converting to CSR representation
CSR conversion done
Computing the principal singular vectors using randomized_svd
done in 6.959s
Top wikipedia pages according to principal singular vectors
['1989',
 '1971',
 '1975',
 '1970',
 '2006',
 '1972',
 '1996',
 '1966',
 '1967',
 '2007']
['Soviet_Union',
 'Spain',
 'Italy',
 'Canada',
 'Japan',
 'Germany',
 'World_War_II',
 'France',
 'United_Kingdom',
 'United_States']
Computing principal eigenvector score using a power iteration method
Normalizing the graph
power iteration #0
error: 0.988448
power iteration #1
error: 0.495765
power iteration #2
error: 0.304177
power iteration #3
error: 0.206610
power iteration #4
error: 0.149243
power iteration #5
error: 0.112218
power iteration #6
error: 0.086735
power iteration #7
error: 0.068370
power iteration #8
error: 0.054681
power iteration #9
error: 0.044216
power iteration #10
error: 0.036059
power iteration #11
error: 0.029603
power iteration #12
error: 0.024433
power iteration #13
error: 0.020252
power iteration #14
error: 0.016845
power iteration #15
error: 0.014051
power iteration #16
error: 0.011748
power iteration #17
error: 0.009841
power iteration #18
error: 0.008257
power iteration #19
error: 0.006937
power iteration #20
error: 0.005834
power iteration #21
error: 0.004912
power iteration #22
error: 0.004138
power iteration #23
error: 0.003489
power iteration #24
error: 0.002943
power iteration #25
error: 0.002484
power iteration #26
error: 0.002097
power iteration #27
error: 0.001771
power iteration #28
error: 0.001496
power iteration #29
error: 0.001264
power iteration #30
error: 0.001068
power iteration #31
error: 0.000903
power iteration #32
error: 0.000763
power iteration #33
error: 0.000645
power iteration #34
error: 0.000546
power iteration #35
error: 0.000461
power iteration #36
error: 0.000390
power iteration #37
error: 0.000330
power iteration #38
error: 0.000279
power iteration #39
error: 0.000236
power iteration #40
error: 0.000200
power iteration #41
error: 0.000169
power iteration #42
error: 0.000143
power iteration #43
error: 0.000121
power iteration #44
error: 0.000102
done in 1.726s
['Africa',
 'Animal',
 'New_York',
 'Jews',
 'Philosophy',
 'Byzantine_Empire',
 'New_York_City',
 'World_War_I',
 'France',
 'United_States']

License

Author:

    Olivier Grisel <olivier.grisel@ensta.org>

License:

    BSD 3 clause
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.