230 likes | 491 Views
Presented by Zheng Zhao Originally designed by Soumya Sanyal http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20-%20Redone.ppt.
E N D
Presented by Zheng Zhao Originally designed by Soumya Sanyal http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20-%20Redone.ppt The PageRank Citation Ranking: Bringing Order to the WebPage L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Projecthttp://dbpubs.stanford.edu/pub/1999-66
Outline • Paper Citations and the Web : Motivation • PageRank : Why it should be considered? • More PageRank: Nuts and bolts • PageRank Unleashed: Looking under the hood • Convergence and Random Walks : Why does it work? • Implementation: Getting your hands dirty • Personalized PageRank: The invisible source • Applications: What wasn’t apparent already • Conclusions
Paper Citations and the Web : Motivation • Academic Citations link to other well known papers • But they are peer reviewed and have quality control • Web of academic documents are homogeneous in their quality, usage, citation & length • Most web pages link to web pages as well • Quality measure of a web page is subjective to the user though • Importance of a page is a quantity that isn’t intuitively possible to capture
Contd. • An user wants to see what is most applicable to her needs first. • The job of the retrieval system is to present the more relevant documents up front. • The notion of quality or relative importance of a web page magnifies • The average quality experienced by an user is higher than the average quality of the average web page. • Notations Used: • Backlinks (inedges) : Links that point to a certain page • Forward Links (outedges): Links that emanate from that page
PageRank : Why it should be considered? • Think of a color palette • Colors are formed by the mixture of one or more colors • The amount and intensity of each color you mix ultimately governs the color of the final mixture not the number of colors !!! • Now think of a Web Page • A number of back links (inedges) point to this webpage • Say a certain back link came from Yahoo! and another came from an obscure home page. • Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’. • Now say the importance of the Yahoo! Page was mapped to the amount (intensity) of one color and the ‘home page’ to another color • Importance of back links rather than their number. + +
More PageRank: Nuts and bolts • Say for any Web Page u the number of forward links is given by Fuand the number of back links beBuand Nu=| Fu | • R() = Rank of page u ; c = Normalization Constant • Note: c < 1 to cover for pages with no outgoing links
Contd.. • So what does the overall picture look like? • A is designated to be a matrix, u and v correspond to the columns of this matrix
Contd.. (Matrices Revisited) • Eigenvectors and eigenvalues • Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue. • It can be found out by recursing the previous equation till the recurrence converges. • A set of eigenvalues form what is called the eigenspace.
Contd.. (A Walk Through Example) • Lets take an example AT=
Contd.. • Matrix Notation R = c A R = M R c : eigenvalue R : eigenvector of A A x = λ x | A - λI | x = 0 A = R = Normalized =
Contd.. (Markov Chains) • Random surfer model • Description of a random walk through the Web graph • Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page • The above notion is fundamental to any Markovian System. For a discrete notion of the above, the following is assumed. • Rt= M Rt-1M: transition matrix for a first-order Markov chain (stochastic) • The question is does it converge to some sensible solution (as t) regardless of the initial ranks ?
Contd..(Issues..) • The above equation would converge were it not for a little problem • This problem is called the ‘Rank Sink’ Problem. • The sink accumulates rank, but never distributes it!
Contd..() • In general many Web pages don’t have either backlinks or forward links. • Results in dangling edges of the graph • no parent rank 0 • MT converges to a matrix whose last column is all zero • no children no solution • MT converges to zero matrix
Contd..(More Random Surfer) • How do we escape from this ? • A: We actually ‘escape’ from it. • Say a surfer is randomly clicking and hopping from one page to the other. • If this surfer keeps going back to the ‘same’ set of pages, she will get bored (in reality too) and try and ‘escape’ from this set of pages. • Hence, we associate an ‘escape’ factor E to account for this ‘boredom’. • How do we model this escape probability • We term this E to be a vector over all the web pages that accounts for each page’s escape probability.
Contd.. • Given this Escape vector, how do we associate this with the original model • In matrix notation where • It can be rewritten as • Hence
PageRank Unleashed: Looking under the hood The main algorithm : • What can we say about d and ? • d1 is called the eigengap and it controls the rate of convergence • is the convergence threshold
Convergence and Random Walks : Why does it work? • Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix • What is the issue all about? • We need a transition matrix model that is guaranteed convergence and does indeed converge to a unique stationary distribution vector.
Contd.. • Addition of the escape vector E, allows us to make the original matrix A be both primitive and stochastic • This guarantees convergence • What about the addition of new links • Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? • The connectivity of a portion of the graph is changed arbitrary • How will it affect the results of algorithms? • Ng et al. (2001) IJCAI and Bianchini et al. (2002) WWW’02 • It is possible to perturb a symmetric matrix by a quantity that grows as d1 that produces a constant perturbation of the dominant eigenvector
Contd.. • Convergence Experiment(s) • Expander graphs and d1 (every subset S has a neighborhood bounded by some factor times |S|) • Rapidly mixing random walk : Convergence is guaranteed in logarithmic time in the order of the size of the graph
Implementation: Getting your hands dirty • In 1998 • 24 million web pages • Crawler builds an index of links • To do this in 5 days, 50 Web pages/second need to be crawled • 11 is the average outdegree, 550 links/second • 75 million unique URL’s to be compared against • URL’s are hashed to unique integer ID • No dangling links are kept initially • Vector E will help in convergence issues also • Weights were kept for 75 million URLs @ 4 bytes/weight (300MB) • Access to link Database is linear since it is sorted • `99 – 800 million pages; `00 - 2 billion; `01 – 4 billion
Personalized PageRank: The invisible source • ||E||1=0.15 • Web Pages are valued because they exist! • Web Pages with many related links receive an overly high ranking • The other extreme – E for just one web page • Netscape Home Page and John McCarthy’s home page
Applications: What wasn’t apparent already • Estimating Web Traffic • How PageRank corresponds to actual usage • Internet proxy cache from NLANR compared to PageRank • 2.6 million pages intersect with PageRank’s indexed 75 mil. • Web based email access is one plausible reason for this disparity • People look at certain pages but never link them • Backlink Predictor • PageRank is a better predictor for future citation counts than citation counts themselves. • Experiment starts out with one URL and no other information • Goal is to crawl the Web in the order of their importance • Importance being an Evaluation function on the number of citation counts (number of backlinks) • PageRank escapes local minima, citation count get stuck in these.
Conclusions • In essence, the importance of one page being dependent on the importance of its predecessors is like a ‘peer’ review. • NASDAQ – 17th February, 2005 - $197.41 : Need I say More?