220 likes | 330 Views
Lecture 4. CS492 Special Topics in Computer Science Distributed Algorithms and Systems. “The PageRank Citation Ranking: Bringing Order to the Web”. L. Page, S. Brin , R. Motwani , T Winograd 1998. Origin of “Google”. Hostnames Active. http://news.netcraft.com. Googol 10^100
E N D
Lecture 4 CS492 Special Topics in Computer Science Distributed Algorithms and Systems
“The PageRank Citation Ranking:Bringing Order to the Web” L. Page, S. Brin, R. Motwani, T Winograd 1998 Fall 2008 CS492
Origin of “Google” Hostnames Active http://news.netcraft.com Fall 2008 CS492 • Googol • 10^100 • Motivation behind • Human maintained indices such as Yahoo! • Explosive growth
Design Goals of Google Fall 2008 CS492 • Improved search quality • In 1997, 1 out of 4 top search engines found itself • High precision in finding relevant document was necessary • Academic search engine research • Search engine technology went commercial: an black art • To build systems that a good number of people could use • To build an architecture to support novel research on large-scale Web data
Weakness of Existing Approaches Fall 2008 CS492 • Calculate similarities • Based on flat, vector-space model of each page • Prone to cheating (Web spamming or search engine persuasion)
Basic Idea of PageRank Fall 2008 CS492 Exploit the topological structure of hypertextual systems
Simple Example A 0.4 C 0.4 B 0.2 Fall 2008 CS492
Related Work Fall 2008 CS492 • Academic citation analysis • Similarities • Graph structure; paper = node, web page = node citation = link, URL = link • “node” authority independent of “node” content • Differences • Uniform unit of info (paper) versus great variability in quality, usage, citations, and length • Equal link weight vs variable importance A backlink from Yahoo! vs. from a friend
Which Page Should Be Ranked Higher? JohnDoe A B Fall 2008 CS492
Simple Expression page rank of set of pages pointing at out-degree of Question: role of c? Answer: total rank of all web pages constant Fall 2008 CS492
Dangling links Fall 2008 CS492 • Pages without outgoing pointers • Example: Pages not yet downloaded • Do not affect the calculation much • Remove them, calculate ranks, and add them back
Loop A C B Question: ranks of A, B, and C? Answer: infinite! (rank sink) Fall 2008 CS492
Basic Algorithm page rank of set of pages pointing at out-degree of dumping factor Fall 2008 CS492
Matrix Representation where and Fall 2008 CS492 Question: Where to start?
Iterative Algorithm where and Question: Will it converge? Fall 2008 CS492
Example [LM04] Fall 2008 CS492
Turn the Problem into a Markov Process [LM04] Fall 2008 CS492
Evenly Split Rank of Dangling Links [LM04] Fall 2008 CS492
Final Solution Fall 2008 CS492 Eigenvector of P = steady state rank
Spam Rank [BGS05] Fall 2008 CS492
Questions Fall 2008 CS492 • Where to start? • Find a nondegenerate start vector • What if there are two pages that point to each other and no one else and there is a page that points to one of them? • Role of dumping factor guarantees no rank sink
References Fall 2008 CS492 [BP98] Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer Networks and ISDN Systems, Vol. 30, 1998. [BGS05] Monica Bianchini, Marco Gori, Franco Scarselli, “Inside PageRank,” ACM Transactions on Internet Technology, Vol. 5, No. 1, Feb. 2005. [LM04] Amy N. Langville, Carl Meyer, “Deeper inside PageRank,” Internet Mathematics, Vol. I, No. 3, 2004. [K99] Jon Kleinberg, “Authoritative sources in a Hyperlinked Environment,” Journal of the ACM 46:5 (1999).