220 likes | 402 Views
The math behind PageRank. A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab, UNT. Partial citations of references. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page
E N D
The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab, UNT
Partial citations of references • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Sergey Brin and Lawrence Page • Inside PageRank • Monica Bianchini, Marco Gori, and Franco Scarselli • Deeper Inside PageRank • Amy Langville and Carl Meyer • Efficient Computation of PageRank • TaherHaveliwala • Topic Sensitive PageRank • TaherHaveliwala
Overview of the talk • Why PageRank • What is PageRank • How PageRank is used • Math • More math • Remaining math
Why PageRank • Need to build a better automatic search engine • Why? • Human maintained lists subjective and expensive to build (non-automatic) • Automatic engines based on keyword matching do a horrible job (just page content is not enough; cleverly placed words in a page can mislead search engines) • Advertisers sometimes mislead search engines • Solution: Google [modern day: much more than PageRank; getting smarter] • Exact technology: not public domain • Core technology: PageRank (utilizes link structure) • Other uses • Any problem that can be visualized as a graph problem where the centrality of the vertices needs to be computed (NLP, etc.)
What is PageRank • A way to find the most ‘important’ vertices in a graph • PR(A) = (1-d) + d [ PR(T1) / C(T1) + … + PR(Tn) / C(Tn) ] • Forms a probability distribution over the vertices [sum = 1] • How does this relate to Web search? • Vertices = pages • Incoming edges = hyperlinks from other pages • Outgoing edges = hyperlinks to other pages
Simple visualization: the simplest variant of PageRank in use [user behavior] Random surfer Only one incoming link, yet high PageRank Damping factor
Lexical Substitution: A crash course There are different types of managed care systems
PageRank in use: Lexical Substitution Weights: word similarity Directed/ undirected: whole other realm
The math behind PageRank • Intuitive correctness • Mathematical foundation • Stability • Complexity of computational scheme • Critical role of the parameters involved • The distribution of the page score • Role of dangling pages • How to promote certain vertices (Web pages)
Intuitive correctness • Concept of ‘voting’ • Related to citation in scientific literature • More citations indicate great/ important piece of work • Random surfer / random walk • A page with many links to it must be important • A very important page must point to something equally important
Mathematical foundation • Most researchers: Markov chains • Caveat: Only applicable in absence of dangling nodes • Basic idea: authority of a Web page unrelated to its contents [comes from the link structure] • Simple representation • Vector representation IN = [1, 1, 1 … 1]’ Transition matrix: ∑(each column) = 1 or 0
Mathematical foundation (2) Google’s iterative version: converges to a stationary solution Jacobi algorithm Alternative computation ||x(t)||1 = 1; normalized
Even more on energy [community promotion] • Split same content into smaller vertices • Avoid dangling pages • Avoid many outgoing links
Page promotion • Treat certain pages as communities • Bias certain pages by using a non-uniform distribution in the vector IN • Tinker with the connectivity [PageRank is proved to be affected by the regularity of the connection pattern]
Computation of PageRank • PageRank can be computed on a graph changing over time • Practical interest [Web is alive] • An optimal algorithm exists for computing PageRank • Practical applications: Search engines, PageRank on billions of pages – efficiency! • Ο(|Η| log 1/ε) • NOT dependent on the connectivity or other dimensions • Ideal computation: stops when the ranking of vertices between two computations does not change [converge]
The Markov model from the Web • The PageRank vector can only exist if the Markov chain is irreducible • By nature, the Web is non-bipartite, sparse, and produces a reducible Markov chain • The Web hyperlinked matrix is forced to be • Stochastic [non-negatives, all columns sum up to 1] • Remove dangling nodes/ replace relevant rows/ columns with a small value, usually [1/n].eT • Introduce personalization vector • Primitive • Non-negative • One positive element on the main diagonal • Irredicible
More on the Markov structure • A convex combination of the original stochastic matrix and a stochastic perturbation matrix • Produces a stochastic, irreducible matrix • The PageRank vector is guaranteed to exist for this matrix • Every node directly connected to another node, all probabilities non zero • Irreducible Markov chain, will converge
There’s more to PageRank • Computation • Power method • Notoriously slow • Method of choice • Requires no computation of intermediate matrices • Converges quickly • Linear systems method • The damping factor [usually 0.85] • Greater value: more iterations required • ‘Truer’ PageRanks • Dangling pages • Storage issues
The end [for today] Thanks for listening!