110 likes | 118 Views
Explore how Google orders search results using traditional information retrieval techniques and PageRank algorithm, dependent on the link structure of the web. This article provides an overview of PageRank and its mathematical definition.
E N D
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA kruse@juniata.edu http://faculty.juniata.edu/kruse
How does Google order search results so well? • A mix of traditional information retrieval techniques and PageRank • PageRank is not a simple citation index • The algorithm to determine a web-page’s PageRank depends SOLELY on the link structure of the web, and NOT the content of the web-page • Link information can be determined after web-crawlers traverse each link on each web-page • Primary Source: Larry Page, Sergei Brin, et. al., The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Technologies Project, 1998.
PageRank analogous to popularity • The web as a graph: each page is a vertex, each hyperlink a directed edge • I am a popular page if a few very popular pages point (via hyperlinks) to me • I am a popular page if many not-necessarily popular pages point (via hyperlinks) to me Page A Page B Which of these three has the highest page rank? Page C
So what is the mathematical definition of PageRank? In particular, my page’s rank is equal to the sum of the ranks of all the pages pointing to me note the scaling of each page rank
Writing out the equation for each web-page in our example gives: Page A Page B Page C
Even though this is a circular definition we can calculate the ranks.Re-write the system of equations as a Matrix-Vector product. The PageRank vector is simply an eigenvector (scalar*vector = matrix*vector) of the coefficient matrix! (Note: we choose the vector with )
PageRank = 0.4 PageRank = 0.2 Page A Page B Page C PageRank = 0.4
Note that the coefficient matrix is stochastic The eigenvector giving the rank is associated with the dominant eigenvalue of 1.Some computational issues remain: - Rank-sinks (endless hyperlink loops) - Eigenvector calculation on huge matrix
Surf’s Up! Add a random-surfer term to the simple PageRank formula This models the behavior of a real web-surfer, who might jump to another page by directly typing in a URL or by choosing a bookmark, rather than clicking on a hyperlink.
This gives a regular matrix • In matrix notation we have • Since we can rewrite as • The new coefficient matrix is regular, so we can calculate the eigenvector iteratively. • This iterative process is a series of matrix-vector products, beginning with an initial vector (typically the previous PageRank vector). These products can be calculated without explicitly creating the huge coefficient matrix.
Any Questions? Handouts Slides also available at http://faculty.juniata.edu/kruse