Algorithms for Large Data Sets

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006 http://www.ee.technion.ac.il/courses/049011

Ranking Algorithms

PageRank [Page, Brin, Motwani, Winograd 1998] • Motivating principles • Rank of p should be proportional to the rank of the pages that point to p • Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva • Rank of p should depend on the number of pages “co-cited” with p • Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

PageRank, Attempt #1 • Additional Conditions: • r is non-negative: r ≥ 0 • r is normalized: ||r||1 = 1 • B = normalized adjacency matrix: • Then: • r is a non-negative normalized left eigenvector of B with eigenvalue 1

PageRank, Attempt #1 • Solution exists only if B has eigenvalue 1 • Problem: B may not have 1 as an eigenvalue • Because some of its rows are 0. • Example:

PageRank, Attempt #2 •  = normalization constant • r is a non-negative normalized left eigenvector of B with eigenvalue 1/

PageRank, Attempt #2 • Any nonzero eigenvalue  of B may give a solution • l = 1/ • r = any non-negative normalized left eigenvector of B with eigenvalue  • Which solution to pick? • Pick a “principal eigenvector” (i.e., corresponding to maximal ) • How to find a solution? • Power iterations

PageRank, Attempt #2 • Problem #1: Maximal eigenvalue may have multiplicity > 1 • Several possible solutions • Happens, for example, when graph is disconnected • Problem #2: Rank accumulates at sinks. • Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass.

PageRank, Final Definition • e = “rank source” vector • Standard setting: e(p) = /n for all p ( < 1) • 1 = the all 1’s vector • Then: • r is a non-negative normalized left eigenvector of (B + 1eT) with eigenvalue 1/

PageRank, Final Definition • Any nonzero eigenvalue of (B + 1eT) may give a solution • Pick r to be a principal left eigenvector of (B + 1eT) • Will show: • Principal eigenvalue has multiplicity 1, for any graph • There exists a non-negative left eigenvector • Hence, PageRank always exists and is uniquely defined • Due to rank source vector, rank no longer accumulates at sinks

An Alternative View of PageRank:The Random Surfer Model • When visiting a page p, a “random surfer”: • With probability 1 - d, selects a random outlink p  q and goes to visit q. (“focused browsing”) • With probability d, jumps to a random web page q. (“loss of interest”) • If p has no outlinks, assume it has a self loop. • P: probability transition matrix:

PageRank & Random Surfer Model Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P. Suppose: Then:

PageRank & Markov Chains • PageRank vector is normalized principal left eigenvector of (B + 1eT). • Hence, PageRank vector is also a principal left eigenvector of P • Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain. • PageRank(p) = r(p) = probability of random surfer visiting page p at the limit. • Note: “Random jump” guarantees Markov Chain is ergodic.

HITS: Hubs and Authorities [Kleinberg, 1997] • HITS: Hyperlink Induced Topic Search • Main principle: every page p is associated with two scores: • Authority score: how “authoritative” a page is about the query’s topic • Ex: query: “IR”; authorities: scientific IR papers • Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites • Hub score: how good the page is as a “resource list” about the query’s topic • Ex: query: “IR”; hubs: surveys and books about IR • Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

Mutual Reinforcement HITS principles: • p is a good authority, if it is linked by many good hubs. • p is a good hub, if it points to many good authorities.

HITS: Algebraic Form • a: authority vector • h: hub vector • A: adjacency matrix • Then: • Therefore: • a is principal eigenvector of ATA • h is principal eigenvector of AAT

Co-Citation and Bibilographic Coupling • ATA: co-citation matrix • ATAp,q = # of pages that link both to p and to q. • Thus: authority scores propagate through co-citation. • AAT: bibliographic coupling matrix • AATp,q = # of pages that both p and q link to. • Thus: hub scores propagate through bibliographic coupling. p q p q

Principal Eigenvector Computation • E: n × n matrix • |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E • Suppose 1 > 0 • v1,…,vn: corresponding eigenvectors • Eigenvectors are form an orthornormal basis • Input: • The matrix E • A unit vector u, which is not orthogonal to v1 • Goal: compute 1 and v1

The Power Method

Why Does It Work? • Theorem: As t  , w  c · v1 (c is a constant) • Convergence rate: Proportional to (2/1)t • The larger the “spectral gap” 2 - 1, the faster the convergence.

Spectral Methods in Information Retrieval

Outline • Motivation: synonymy and polysemy • Latent Semantic Indexing (LSI) • Singular Value Decomposition (SVD) • LSI via SVD • Why LSI works? • HITS and SVD

Synonymy and Polysemy • Synonymy: multiple terms with (almost) the same meaning • Ex: cars, autos, vehicles • Harms recall • Polysemy: a term with multiple meanings • Ex: java (programming language, coffee, island) • Harms precision

Traditional Solutions • Query expansion • Synonymy: OR on all synonyms • Manual/automatic use of thesauri • Too few synonyms: recall still low • Too many synonyms: harms precision • Polysemy: AND on term and additional specializing terms • Ex: +java +”programming language” • Too broad terms: precision still low • Too narrow terms: harms recall

Syntactic Space documents • D: document collection, |D| = n • T: term space, |T| = m • At,d: “weight” of t in d (e.g., TFIDF) • ATA: pairwise document similarities • AAT: pairwise term similarities A terms m n

Syntactic Indexing • Index keys: terms • Limitations • Synonymy • (Near)-identical rows • Polysemy • Space inefficiency • Matrix usually is not full rank • Gap between syntax and semantics: Information need is semantic but index and query are syntactic.

Semantic Space documents • C: concept space, |C| = r • Bc,d: “weight” of c in d • Change of basis • Compare to wavelet and Fourier transforms B r concepts n

Latent Semantic Indexing (LSI)[Deerwester et al. 1990] • Index keys: concepts • Documents & query: mixtures of concepts • Given a query, finds the most similar documents • Bridges the syntax-semantics gap • Space-efficient • Concepts are orthogonal • Matrix is full rank • Questions • What is the concept space? • What is the transformation from the syntax space to the semantic space? • How to filter “noise concepts”?

Singular Values • A: m×n real matrix • Definition:  ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t. Av = u and ATu = v u and v are called singular vectors. • Ex:  = ||A||2 = max||x||2 = 1 ||Ax||2. • Corresponding singular vectors: x that maximizes ||Ax||2 and y = Ax / ||A||2. • Note: ATAv = 2v and AATu = 2u • 2 is eigenvalue of ATA and AAT • u eigenvector of ATA • v eigenvector of AAT

Singular Value Decomposition (SVD) • Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U  VT • 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A •  = Diag(1,…,r) • U: column-orthonormal m×r matrix (UT U = I) • V: column-orthonormal n×r matrix (VT V = I) U A  VT × × =

Singular Values vs. Eigenvalues A = U  VT • 1,…,r: singular values of A • 12,…,r2: non-zero eigenvalues of ATA and AAT • u1,…,ur: columns of U • Orthonormal basis for span(columns of A) • Left singular vectors of A • Eigenvectors of ATA • v1,…,vr: columns of V • Orthonormal basis for span(rows of A) • Right singular vectors • Eigenvectors of AAT

LSI as SVD • A = U  VT UTA =  VT • u1,…,ur : concept basis • B =  VT : LSI matrix • Ad: d-th column of A • Bd: d-th column of B • Bd = UTAd • Bd[c] = ucT Ad

Noisy Concepts B = UTA =  VT • Bd[c] = c vd[c] • If c is small, then Bd[c] small for all d • k = largest i s.t. i is “large” • For all c = k+1,…,r, and for all d, c is a low-weight concept in d • Main idea: filter out all concepts c = k+1,…,r • Space efficient: # of index terms = k (vs. r or m) • Better retrieval: noisy concepts are filtered out across the board

Low-rank SVD B = UTA =  VT • Uk = (u1,…,uk) • Vk = (v1,…,vk) • k = upper-left k×k sub-matrix of  • Ak = Ukk VkT • Bk = Sk VkT • rank(Ak) = rank(Bk) = k

Low Dimensional Embedding • Forbenius norm: • Fact: • Therefore, if is small, then for “most” d,d’, . • Ak preserves pairwise similarities among documents  at least as good as A for retrieval.

Computing SVD • Compute singular values of A, by computing eigenvalues of ATA • Compute U,V by computing eigenvectors of ATA and AAT • Running time not too good: O(m2 n + m n2) • Not practical for huge corpora • Sub-linear time algorithms for estimating Ak[Frieze,Kannan,Vempala 1998]

HITS and SVD • A: adjacency matrix of a web (sub-)graph G • a: authority vector • h: hub vector • a is principal eigenvector of ATA • h is principal eigenvector of AAT • Therefore: a and h give A1: the rank-1 SVD of A • Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001] • LSI summary • Documents are embedded in low dimensional space (m  k) • Pairwise similarities are preserved • More space-efficient • But why is retrieval better? • Synonymy • Polysemy

Generative Model • A corpus modelM = (T,C,W,D) • T: Term space, |T| = m • C: Concept space, |C| = k • Concept: distribution over terms • W: Topic space • Topic: distribution over concepts • D: Document distribution • Distribution over W × N • A document d is generated as follows: • Sample a topic w and a length n according to D • Repeat n times: • Sample a concept c from C according to w • Sample a term t from T according to c

Simplifying Assumptions • Every document has a single topic (W = C) • For every two concepts c,c’, ||c – c’|| ≥ 1 -  • The probability of every term under a concept c is at most some constant .

LSI Works • A: m×n term-document matrix, representing n documents generated according to the model • Theorem[Papadimitriou et al. 1998] With high probability, for every two documents d,d’, • If topic(d) = topic(d’), then • If topic(d)  topic(d’), then

Proof • For simplicity, assume  = 0 • Want to show: • If topic(d) = topic(d’), Adk || Ad’k • If topic(d)  topic(d’), Adk Ad’k • Dc: documents whose topic is the concept c • Tc: terms in supp(c) • Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø • A has non-zeroes only in blocks: B1,…,Bk, where Bc: sub-matrix of A with rows in Tc and columns in Dc • ATA is a block diagonal matrix with blocks BT1B1,…, BTkBk • (i,j)-th entry of BTcBc: term similarity between i-th and j-th documents whose topic is the concept c • BTcBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

Proof (cont.) • Gc is a “random” graph • First and second eigenvalues of BTcBc are well separated • For all c,c’, second eigenvalue of BTcBc is smaller than first eigenvalue of BTc’Bc’ • Top k eigenvalues of ATA are the principal eigenvalues of BTcBc for c = 1,…,k • Let u1,…,uk be corresponding eigenvectors • For every document d on topic c, Ad is orthogonal to all u1,…,uk, except for uc. • Akd is a scalar multiple of uc.

Extensions[Azar et al. 2001] • A more general generative model • Explain also improved treatment of polysemy

End of Lecture 5

Algorithms for Large Data Sets