360 likes | 525 Views
How PageRank Works. Ketan Mayer-Patel University of North Carolina January 31, 2011. Me vs. Jeff. High school Public school in Texas College The University of California, Berkeley Faculty member at... UNC. High School Hoity-toity, private all-boys school in Jersey College Stanford
E N D
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011
Me vs. Jeff • High school • Public school in Texas • College • The University of California, Berkeley • Faculty member at... • UNC • High School • Hoity-toity, private all-boys school in Jersey • College • Stanford • Faculty member at... • Duke
The World Wide Web • A Simple Request/Response System Request for web page. Web page returned.
Making The Request • How do you make a web request? • Use a browser. • Specify what you want directly. • Follow a link. • Turns out we very rarely specify documents directly. • Uniform Resource Locator (URL) • http://server-name.com/path/to/a/page • Two key characteristics of hyperlinks: • Directional • Unilateral
Web Search In Three Easy Steps • What’s step one? • Cut a hole in the box.
Web Search In Three Easy Steps • First, crawl. • Try to find all of the web pages. • Follow the links. • Second, index. • Organize what you find. • Lots of secret sauce here. • Third, query. • Usually, text query words. • Retrieves a list of related pages. • Usually because they contain the query text.
Which to list first? • Possible clues: • Number of times the query term appears • Where it appears • Title, body text, URL, metadata, etc. • How it appears • Style of text • Role of text • Position in the document graph • This is what distinguished Google from other search engines at the time.
PageRank • Supposedly named after Larry Page • Part of his research in grad school • Patented while in grad school. • Licensed to Google for ~ 1 million shares of Google. • Sold for about $300M
Probability Distribution of a Random Walk • Start walking the graph. • After some reasonably long amount of time, stop. • What’s the chance that you are on a particular page. • Larger chance => more important page • Is this actually true? • Maybe, maybe not
Trapdoors and Dead Ends Hotel California: Can’t ever leave. Shangri-La: Can’t ever get here.
Fixing Our Random Walk • What can we do to fix it? • Add a bit more randomness. • At each step, with probabilityαjump to any random page. • Otherwise, randomly follow a link. • Provides a way in to / out of trapdoors / dead ends and spider traps.
Random Walk Scalability • Problem: Would need to simulate the random walk over and over again to even come close to discovering the underlying probability distribution. • Easy to do for small graphs. • Pain in the ass for large ones. • Markov Chain • Tool for analyzing stochastic processes. • Power method
Power Method Equation • N : Number of documents • Rk: Page rank of document k • Lk : Number of outgoing links in k • δ(k,j) : Delta functionforlinks between k and j δ(k,j) = 1 if and only if there exists a link from document k to document j
Power Method Equation • Our definition is circular. • To calculate page rank of a page we need to already know the page rank of other pages. • Iterative solution. • Start with an initial assignment. • Basically set the page rank of every page to 1/N. • Why 1/N? • Calculate an updated value for every page using the current values. • Keep repeating until the value are stable.
Power Method Equation • Intuition: • Page rank of a document is the sum of its fair share of the page ranks of the pages that link to the document.
Example i= 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Example i= 1 0.025 0.075 0.125 0.05 0.1 0.1 0.1 0.2 0.125 0
Example Something is wrong! i= 10 0.015 0.051 0.189 0.036 0.134 0.072 0.154 0.071 0.015 0
Power Method v2 • Dead ends leak. • Spider traps slowly collect everything. • Translating our random walk solution: • Add a “virtual” link from every document to every other document. • Define a weighting factor α between 0.0 and 1.0 • Distribute α proportion of your page rank over the virtual links • Distribute (1- α) proportion of your page rank over the real links
Power Method v2 • Dead ends leak. • Spider traps slowly collect everything. • Translating our random walk solution: • Add a “virtual” link from every document to every other document. • Define a weighting factor α between 0.0 and 1.0 • Distribute α proportion of your page rank over the virtual links • Distribute (1- α) proportion of your page rank over the real links
Convergence • Typical value for α is 0.15. • Convergence typically occurs in about 50 iterations even for large graphs.
Example i= 10 0.024 0.074 0.115 0.061 0.112 0.073 0.107 0.105 0.034 0.011
Example i= 10 0.015 0.024 0.051 0.074 0.189 0.115 0.036 0.061 0.134 0.112 0.072 0.073 0.154 0.107 0.071 0.105 0.034 0 0.011 0.015
Billions and billions • How do you do this with billions of documents? • Can be implemented using matrix math. • Special techniques for sparse matrices. • PageRank roughly equivalent to first eigenvector.
Gaming The System • Google Bomb! • Create a lot of links to the page that you want to be highly ranked. • Create your own spider trap. • Relatively easy to combat by discounting links that come from the same domain. • Comment spam. • Porn trap.
Last Notes • Stanford Sucks! • GO HEELS!
Bad Math • When originally presented, the final version of the power method equation was shown as: • The simplification for the first term is wrong and should have been: