1 / 45

Algorithms for Large Data Sets

Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 5 April 23, 2006. http://www.ee.technion.ac.il/courses/049011. Ranking Algorithms. PageRank [Page, Brin, Motwani, Winograd 1998]. Motivating principles Rank of p should be proportional to the rank of the pages that point to p

schuyler
Download Presentation

Algorithms for Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006 http://www.ee.technion.ac.il/courses/049011

  2. Ranking Algorithms

  3. PageRank [Page, Brin, Motwani, Winograd 1998] • Motivating principles • Rank of p should be proportional to the rank of the pages that point to p • Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva • Rank of p should depend on the number of pages “co-cited” with p • Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

  4. PageRank, Attempt #1 • Additional Conditions: • r is non-negative: r ≥ 0 • r is normalized: ||r||1 = 1 • B = normalized adjacency matrix: • Then: • r is a non-negative normalized left eigenvector of B with eigenvalue 1

  5. PageRank, Attempt #1 • Solution exists only if B has eigenvalue 1 • Problem: B may not have 1 as an eigenvalue • Because some of its rows are 0. • Example:

  6. PageRank, Attempt #2 •  = normalization constant • r is a non-negative normalized left eigenvector of B with eigenvalue 1/

  7. PageRank, Attempt #2 • Any nonzero eigenvalue  of B may give a solution • l = 1/ • r = any non-negative normalized left eigenvector of B with eigenvalue  • Which solution to pick? • Pick a “principal eigenvector” (i.e., corresponding to maximal ) • How to find a solution? • Power iterations

  8. PageRank, Attempt #2 • Problem #1: Maximal eigenvalue may have multiplicity > 1 • Several possible solutions • Happens, for example, when graph is disconnected • Problem #2: Rank accumulates at sinks. • Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass.

  9. PageRank, Final Definition • e = “rank source” vector • Standard setting: e(p) = /n for all p ( < 1) • 1 = the all 1’s vector • Then: • r is a non-negative normalized left eigenvector of (B + 1eT) with eigenvalue 1/

  10. PageRank, Final Definition • Any nonzero eigenvalue of (B + 1eT) may give a solution • Pick r to be a principal left eigenvector of (B + 1eT) • Will show: • Principal eigenvalue has multiplicity 1, for any graph • There exists a non-negative left eigenvector • Hence, PageRank always exists and is uniquely defined • Due to rank source vector, rank no longer accumulates at sinks

  11. An Alternative View of PageRank:The Random Surfer Model • When visiting a page p, a “random surfer”: • With probability 1 - d, selects a random outlink p  q and goes to visit q. (“focused browsing”) • With probability d, jumps to a random web page q. (“loss of interest”) • If p has no outlinks, assume it has a self loop. • P: probability transition matrix:

  12. PageRank & Random Surfer Model Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P. Suppose: Then:

  13. PageRank & Markov Chains • PageRank vector is normalized principal left eigenvector of (B + 1eT). • Hence, PageRank vector is also a principal left eigenvector of P • Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain. • PageRank(p) = r(p) = probability of random surfer visiting page p at the limit. • Note: “Random jump” guarantees Markov Chain is ergodic.

  14. HITS: Hubs and Authorities [Kleinberg, 1997] • HITS: Hyperlink Induced Topic Search • Main principle: every page p is associated with two scores: • Authority score: how “authoritative” a page is about the query’s topic • Ex: query: “IR”; authorities: scientific IR papers • Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites • Hub score: how good the page is as a “resource list” about the query’s topic • Ex: query: “IR”; hubs: surveys and books about IR • Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

  15. Mutual Reinforcement HITS principles: • p is a good authority, if it is linked by many good hubs. • p is a good hub, if it points to many good authorities.

  16. HITS: Algebraic Form • a: authority vector • h: hub vector • A: adjacency matrix • Then: • Therefore: • a is principal eigenvector of ATA • h is principal eigenvector of AAT

  17. Co-Citation and Bibilographic Coupling • ATA: co-citation matrix • ATAp,q = # of pages that link both to p and to q. • Thus: authority scores propagate through co-citation. • AAT: bibliographic coupling matrix • AATp,q = # of pages that both p and q link to. • Thus: hub scores propagate through bibliographic coupling. p q p q

  18. Principal Eigenvector Computation • E: n × n matrix • |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E • Suppose 1 > 0 • v1,…,vn: corresponding eigenvectors • Eigenvectors are form an orthornormal basis • Input: • The matrix E • A unit vector u, which is not orthogonal to v1 • Goal: compute 1 and v1

  19. The Power Method

  20. Why Does It Work? • Theorem: As t  , w  c · v1 (c is a constant) • Convergence rate: Proportional to (2/1)t • The larger the “spectral gap” 2 - 1, the faster the convergence.

  21. Spectral Methods in Information Retrieval

  22. Outline • Motivation: synonymy and polysemy • Latent Semantic Indexing (LSI) • Singular Value Decomposition (SVD) • LSI via SVD • Why LSI works? • HITS and SVD

  23. Synonymy and Polysemy • Synonymy: multiple terms with (almost) the same meaning • Ex: cars, autos, vehicles • Harms recall • Polysemy: a term with multiple meanings • Ex: java (programming language, coffee, island) • Harms precision

  24. Traditional Solutions • Query expansion • Synonymy: OR on all synonyms • Manual/automatic use of thesauri • Too few synonyms: recall still low • Too many synonyms: harms precision • Polysemy: AND on term and additional specializing terms • Ex: +java +”programming language” • Too broad terms: precision still low • Too narrow terms: harms recall

  25. Syntactic Space documents • D: document collection, |D| = n • T: term space, |T| = m • At,d: “weight” of t in d (e.g., TFIDF) • ATA: pairwise document similarities • AAT: pairwise term similarities A terms m n

  26. Syntactic Indexing • Index keys: terms • Limitations • Synonymy • (Near)-identical rows • Polysemy • Space inefficiency • Matrix usually is not full rank • Gap between syntax and semantics: Information need is semantic but index and query are syntactic.

  27. Semantic Space documents • C: concept space, |C| = r • Bc,d: “weight” of c in d • Change of basis • Compare to wavelet and Fourier transforms B r concepts n

  28. Latent Semantic Indexing (LSI)[Deerwester et al. 1990] • Index keys: concepts • Documents & query: mixtures of concepts • Given a query, finds the most similar documents • Bridges the syntax-semantics gap • Space-efficient • Concepts are orthogonal • Matrix is full rank • Questions • What is the concept space? • What is the transformation from the syntax space to the semantic space? • How to filter “noise concepts”?

  29. Singular Values • A: m×n real matrix • Definition:  ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t. Av = u and ATu = v u and v are called singular vectors. • Ex:  = ||A||2 = max||x||2 = 1 ||Ax||2. • Corresponding singular vectors: x that maximizes ||Ax||2 and y = Ax / ||A||2. • Note: ATAv = 2v and AATu = 2u • 2 is eigenvalue of ATA and AAT • u eigenvector of ATA • v eigenvector of AAT

  30. Singular Value Decomposition (SVD) • Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U  VT • 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A •  = Diag(1,…,r) • U: column-orthonormal m×r matrix (UT U = I) • V: column-orthonormal n×r matrix (VT V = I) U A  VT × × =

  31. Singular Values vs. Eigenvalues A = U  VT • 1,…,r: singular values of A • 12,…,r2: non-zero eigenvalues of ATA and AAT • u1,…,ur: columns of U • Orthonormal basis for span(columns of A) • Left singular vectors of A • Eigenvectors of ATA • v1,…,vr: columns of V • Orthonormal basis for span(rows of A) • Right singular vectors • Eigenvectors of AAT

  32. LSI as SVD • A = U  VT UTA =  VT • u1,…,ur : concept basis • B =  VT : LSI matrix • Ad: d-th column of A • Bd: d-th column of B • Bd = UTAd • Bd[c] = ucT Ad

  33. Noisy Concepts B = UTA =  VT • Bd[c] = c vd[c] • If c is small, then Bd[c] small for all d • k = largest i s.t. i is “large” • For all c = k+1,…,r, and for all d, c is a low-weight concept in d • Main idea: filter out all concepts c = k+1,…,r • Space efficient: # of index terms = k (vs. r or m) • Better retrieval: noisy concepts are filtered out across the board

  34. Low-rank SVD B = UTA =  VT • Uk = (u1,…,uk) • Vk = (v1,…,vk) • k = upper-left k×k sub-matrix of  • Ak = Ukk VkT • Bk = Sk VkT • rank(Ak) = rank(Bk) = k

  35. Low Dimensional Embedding • Forbenius norm: • Fact: • Therefore, if is small, then for “most” d,d’, . • Ak preserves pairwise similarities among documents  at least as good as A for retrieval.

  36. Computing SVD • Compute singular values of A, by computing eigenvalues of ATA • Compute U,V by computing eigenvectors of ATA and AAT • Running time not too good: O(m2 n + m n2) • Not practical for huge corpora • Sub-linear time algorithms for estimating Ak[Frieze,Kannan,Vempala 1998]

  37. HITS and SVD • A: adjacency matrix of a web (sub-)graph G • a: authority vector • h: hub vector • a is principal eigenvector of ATA • h is principal eigenvector of AAT • Therefore: a and h give A1: the rank-1 SVD of A • Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

  38. Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001] • LSI summary • Documents are embedded in low dimensional space (m  k) • Pairwise similarities are preserved • More space-efficient • But why is retrieval better? • Synonymy • Polysemy

  39. Generative Model • A corpus modelM = (T,C,W,D) • T: Term space, |T| = m • C: Concept space, |C| = k • Concept: distribution over terms • W: Topic space • Topic: distribution over concepts • D: Document distribution • Distribution over W × N • A document d is generated as follows: • Sample a topic w and a length n according to D • Repeat n times: • Sample a concept c from C according to w • Sample a term t from T according to c

  40. Simplifying Assumptions • Every document has a single topic (W = C) • For every two concepts c,c’, ||c – c’|| ≥ 1 -  • The probability of every term under a concept c is at most some constant .

  41. LSI Works • A: m×n term-document matrix, representing n documents generated according to the model • Theorem[Papadimitriou et al. 1998] With high probability, for every two documents d,d’, • If topic(d) = topic(d’), then • If topic(d)  topic(d’), then

  42. Proof • For simplicity, assume  = 0 • Want to show: • If topic(d) = topic(d’), Adk || Ad’k • If topic(d)  topic(d’), Adk Ad’k • Dc: documents whose topic is the concept c • Tc: terms in supp(c) • Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø • A has non-zeroes only in blocks: B1,…,Bk, where Bc: sub-matrix of A with rows in Tc and columns in Dc • ATA is a block diagonal matrix with blocks BT1B1,…, BTkBk • (i,j)-th entry of BTcBc: term similarity between i-th and j-th documents whose topic is the concept c • BTcBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

  43. Proof (cont.) • Gc is a “random” graph • First and second eigenvalues of BTcBc are well separated • For all c,c’, second eigenvalue of BTcBc is smaller than first eigenvalue of BTc’Bc’ • Top k eigenvalues of ATA are the principal eigenvalues of BTcBc for c = 1,…,k • Let u1,…,uk be corresponding eigenvectors • For every document d on topic c, Ad is orthogonal to all u1,…,uk, except for uc. • Akd is a scalar multiple of uc.

  44. Extensions[Azar et al. 2001] • A more general generative model • Explain also improved treatment of polysemy

  45. End of Lecture 5

More Related