160 likes | 255 Views
Ranking the Web Frontier. Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang. Introduction & Contribution. Propose algorithmic innovations for the basic PageRank paradigm. Problem of Web Frontier ( Dangling Nodes)
E N D
Ranking the Web Frontier Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang
Introduction & Contribution • Propose algorithmic innovations for the basic PageRank paradigm. • Problem of Web Frontier ( Dangling Nodes) • Distinguish different types of Dangling Nodes • Propose four techniques for penalty pages • Problem of computing pagerank and rank manipulation • Explore Web hierarchical structure • HostRank & DirRank algorithms
PageRank • BackLinks & Random surfer & Recursive computation • Ideal Model or • The web graph should be strongly connected. • A should be stochastic. (irreducible and aperiodic)
PageRank • Improved Model Add a link from each page to every page and give each link a small transition probability controlled by a parameter α. Random Jump (teleportation) • virtual node n+1 • Variations Issues • Parameter α. • Random jump---uniform distribution • Dangling Nodes
Dangling Nodes • Dangling nodes: Nodes that either have no outlinks or for which no outlinks are known. • How do pages become dangling nodes • Crawlers might not have crawled them. Dynamic Pages. • Protected by a robots.txt • Genuinely have no outlinks: PS, PDF • Meta tag indicating not to follow.
Handling Dangling Nodes • Remove away and then added back. • Random jump • Reduced eigen-system. • Power-iteration. • A single step
Penalty Pages and Link Rot • Penalty pages: pages that are dangling and produce 403 or 404 HTTP code. • Link Rot: links used to work but then broken. (Penalty Link, Dangling Link)
Effects of Dangling Nodes on Ranking • Whether teleportation to dangling nodes. • Yes. 3 has the highest rank score. • No. [0.31746, 0.31746, 0.365079], • 0.269841. Less than 1and 2. • The number of dangling links. • 1 link: [0.198684, 0.283124, 0.283124, 0.235068] • 4 links: [0.195954, 0.229266, 0.279234, 0.29554]
Push-back algorithm • If a page has a link to a penalty page, have its rank reduced by a fraction, and the excess rank should be returned to the pages that pushed rank to it in the previous iteration. • Retain (1-i), distribute iij to its backlinks.
Self-Loop algorithm • Augment each page with a self-loop link to itself . With a i probability follow this link. bi is the number of outlinks from i to penalty pages. gi is the number of outlinks from i to non-penalty pages. • 1- becomes • Some variations.
Jump-weighting algorithm • Instead of evenly redistribution, biasing the redistribution so that penalized pages receive less rank. • A straight-forward method • Weight the link from virtual node • to an unpenalized node in C (strongly connected node set) by • to a penalized node by gi/(gi+bi)
BHITS algorithm • Random walk in both Forward/Backward directions. • Forward step: the same as ordinary PageRank. • Backward step: • Non-dangling nodes: self-loop. • Dangling nodes: • non-penalty nodes: forward score to virtual node. • penalty nodes: divide score by # of inlinks. Equally propagate score among backward links. • Penalty page traverse to a random seed nodes. • Matrix representation
HostRank algorithm • Web Hierarchical Structure • 62.4% links are internal to a site. • 82% outlinks are to the top level of sites. • Not jump uniformly, but to portal or Top-level pages. • Consider all pages on a site as a single body. • Assign them all a rank based on the collective value of information on that site. • Each site represented by one node in the graph. • Web size becomes smaller. Computation become less.
DirRank algorithm • HostRank too coarse a level of granularity & heavy tail distribution. • DirRank graph • Node: groups of URLS with prefixes up to the last “/” or “?”. Virtual directory. • Edges: if there is a link from a URL in the source virtual directory to a URL in the destination virtual directory.
Experiments Results • Setup: • Crawling on IBM Almaden • More than 1 billion pages; 37 billion links; 4.75 billion URLS. • Results: • Reduce computation. • DirRank: 114 million nodes/15 billion edges • HostRank: 19.7 billion hosts(nodes)/1.1 billion edges • Enhance resistance to link manipulation. • 11/20 in 100 million pages. vs 14/100 hostnames • Virtual node probability : 0.82 vs 0.17
Conclusions • PageRank with uniform teleportation are easily subject to link manipulation. • HostRank and DirRank algorithm are both cheaper to compute and less subject to link manipulation. • The proposed 4 techniques for penalty pages can reduce bias and improve ranking performance. • In the future, hope can place the problem of web page ranking on a firmer scientific foundation besides on trade or economic domains.