230 likes | 379 Views
The PageRank Citation Ranking: Bringing Order to the Web. Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das. Technology Overview. Motivation. WWW is huge and heterogeneous WebPages proliferate free of quality control Commercial interest to manipulate ranking
E N D
The PageRank Citation Ranking: Bringing Order to the Web Presented by AishwaryaRengamannan 1000669605 Instructor: Dr. Gautam Das
Motivation • WWW is huge and heterogeneous • WebPages proliferate free of quality control • Commercial interest to manipulate ranking • The ‘quality’ of a webpage is subjective to the users. Problem: Necessity to approximate the overall relative ‘importance’ of web pages. Solution: Take advantage of the Link Structure of the web
Link structure of the Web • Forward Links(Outedges): The outgoing links from a webpage. C is A & B’s forward link. • Back Links(Inedges): Incoming links to a webpage. A & B are back links for C.
Related Work • Academic paper citations • Link based analysis • Clustering methods that take link structure into account • Modeling web as Hubs and Authorities
Ranking Intuition • The quantity of the backlinks to a webpage makes it important. • The quality of the back linked pages increases the ranking. “A page has high rank if the sum of the ranks of it’s backlinks is high.” How about having a backlink from www.yahoo.com?
Naïve PageRank Calculation • u & v --> Webpages • Bu --> backlinks of u • Nv --> Forward Links from v to u. • R --> Ranks of the webpages • c <1 --> Used for normalization
Matrix Representation ‘A’ is a square adjacency Matrix with • Rows and columns corresponding to web pages (u & v) • Au,v = 1/Nu if there is an edge from u to v • Au,v= 0 if there is no edge.
Matrices Revisited Eigen Values and Eigen Vectors: • Matrix A (nXn) • is an Eigen value of Aif there exists a non-zero vector v such that Av= v • vector v is called anEigen vector of A corresponding to . • We can rewrite Av= v as (A− I)v=0, where I is identity matrix (nXn).
Matrices Revisited(Contd…) How to solve for Eigen value and Eigen Vector?
Sample Calculation 3 1 2 4
Matrix Representation (contd…) • A --> square matrix of web pages • R --> vector over webpages • To find: Eigen Vector corresponding to dominant (maximum) Eigen value. • Could be computed by repeatedly iterating till it converges to the dominant Eigen value-Eigen Vector Matrix Notation gives R = c A R c : eigenvalue R : eigenvector of A R = Normalized R =
Problem with Naïve PageRank Rank Sink: • Two web pages that point to each other but to no other page. Third page which points to one of them. • loop will accumulate rank but never distribute it (since there are no out edges).
Solution – Extended version of PageRank Introducing Rank Source: E(u): a vector over the web pages that corresponds to a source of rank.
Random Surfer Model • Random Surfer – Clicks on successive links at random. • The factor ‘E’ can be viewed as modeling this behavior. • “Surfer” periodically gets bored, jumped to a random page based on E.
PageRank Computation - initialize vector over web pages Loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter While - stop when converged
Another Problem? Dangling links: • Links to a page with no link to any other pages • Not clear where their weights should be distributed Solution: Remove them from the system until after calculating all other PageRanks!
Implementation • Web crawler keeps a database of URLs so that it can discover all URLs on the web • To implement PageRank, the web crawler builds an index of the URLs as it crawls Problems??? • Infinitely large sites • Incorrect/Broken HTML • Sites are down • Web is always changing
PageRank Implementation • Convert each URL into unique integer ID • Link structure sorted by the IDs • Remove dangling links • Make a initial assignment of ranks and iterate until convergence • Add the dangling links back • Iterate the process again to assign weights to all dangling links • Link database A, is normally kept in RAM
Convergence Properties • Interpret web as a expander like graph. • if every subsets of nodes S has a neighborhood that is larger than some factor α times |S| • Verification - if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue
Applications of Page Rank • Search, Browsing and Traffic estimation. • Help user decide if a site is trustworthy. • Estimate web traffic. • Spam detection and prevention. • Predict citation counts
http://www.techpavan.com/2008/11/20/backend-google-search/ • http://www.math.hmc.edu/calculus/tutorials/eigenstuff/ • http://williamcotton.com/pagerank-explained-with-javascript