Crawling The Web For a Search Engine

Crawling The Web For a Search Engine Or Why Crawling is Cool

Talk Outline • What is a crawler? • Some of the interesting problems • RankMass Crawler • As time permits: • Refresh Policies • Duplicate Detection

What is a Crawler? initial urls init to visit urls get next url get page visited urls web extract urls web pages

Applications • Internet Search Engines • Google, Yahoo, MSN, Ask • Comparison Shopping Services • Shopping • Data mining • Stanford Web Base, IBM Web Fountain

Is that it? Not quite

Duplicate Pages Mirror Sites Identifying Similar Pages Templates Deep Web When to stop? Incremental Crawler Refresh Policies Evolution of the Web Crawling the “good” pages first Focused Crawling Distributed Crawlers Crawler Friendly Webservers Crawling the Big Picture

Today’s Focus • A crawler which guarantees coverage of the Web • As time permits: • Refresh Policies • Duplicate Detection Techniques

RankMass Crawler A Crawler with High PersonalizedPageRank Coverage Guarantee

Motivation • Impossible to download the entire web: • Example: many pages from one calendar • When can we stop? • How to gain the most benefit from the pages we download

Main Issues • Crawler Guarantee: • guarantee on how much of the “important” part of the Web they “cover” when they stop crawling • If we don’t see the pages, how do we know how important they are? • Crawler Efficiency: • Download “important” pages early during a crawl • Obtain coverage with a min number of downloads

Outline • Formalize coverage metric • L-Neighbor: Crawling with RankMass guarantee • RankMass: Crawling to achieve high RankMass • Windowed RankMass: How greedy do you want to be? • Experimental Results

Web Coverage Problem • D – The potentially infinite set of documents of the web • DC – The finite set of documents in our document collection • Assign importance weights to each page

Web Coverage Problem • What weights? Per query? Topic? Font? • PageRank? Why PageRank? • Useful as importance mesure • Random surfer. • Effective for ranking.

PageRank a Short Review p2 p1 p3 p4

Now it’s Personal • Personal, TrustRank, General p3 p4

RankMass Defined • Using personalized pagerank formally define RankMass of DC : • Coverage Guarantee: • We seek a crawler that given , when it stops the downloaded pages DC: • Efficient crawling: • We seek a crawler that, for a given N, downloads |DC|=N s.t. RM(DC) is greater or equal to any other |DC|=N, DCD

How to Calculate RankMass • Based on PageRank • How do you compute RM(Dc) without downloading the entire web • We can’t compute the exact but can lower bound • Let’s a start a simple case

Single Trusted Page • T(1): t1=1 ; ti = 0 i≠1 • Always jump to p1 when bored • We can place a lowerbound on being within L of P1 • NL(p1)=pages reachable from p1 in L links

Single Trusted Page

Lower bound guarantee: Single Trusted • Theorem 1: • Assuming the trust vector T(1), the sum of the PageRank values of all L-neighbors of p1 is at least dL+1 close to 1.. That is:

Lower bound guarantee: General Case • Theorem 2: • The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is:

RankMass Lower Bound • Lower bound given a single trusted page • Extension: Given a set of trusted pages G • That’s the basis of the crawling algorithm with a coverage guarantee

The L-Neighbor Crawler • L := 0 • N[0] = {pi|ti > 0} // Start with the trusted pages • While ( < dL+1) • Download all uncrawled pages in N[L] • N[L + 1] = {all pages linked to by a page in N[L]} • L = L + 1

But what about efficency? • L-Neighbor similar to BFS • L-Neighbor simple and efficient • May wish to prioritize further certain neighborhoods first • Page level prioritization. t0=0.99 t1=0.01

Page Level Prioritizing • We want a more fine-grained page-level priority • The idea: • Estimate PageRank on a page basis • High priority for pages with a high estimate of PageRank • We cannot calculate exact PageRank • Calculate PageRank lower bound of undownloaded pages • …But how

Probability of being at Page P Click Link Interrupted Page Trusted Page Random Surfer

Calculating PageRank Lower Bound • PageRank(p) = Probability Random Surfer in p • Breakdown path by “interrupts”, jumps to a trusted page • Sum up all paths that start with an interrupt and end with p Interrupt Pj P1 P2 P3 P4 P5 Pi (d*1/3) (d*1/3) (d*1/3) (d*1/5) (d*1/3) (d*1/3) (1-d) (tj)

RankMass Basic Idea p1 0.99 p3 0.25 p6 0.09 p1 0.99 p3 0.25 p7 0.09 p4 0.25 p1 0.99 p1 0.99 p5 0.25 p2 0.01

RankMass Crawler: High Level • But that sounds complicated?! • Luckily we don’t need all that • Based on this idea: • Dynamically update lower bound on PageRank • Update total RankMass • Download page with highest lower bound

RankMass Crawler (Shorter) • Variables: • CRM: RankMass lower bound of crawled pages • rmi: Lower bound of PageRank of pi. • RankMassCrawl() • CRM = 0 • rmi = (1 − d)ti for each ti > 0 • While (CRM < 1 − ): • Pick pi with the largest rmi. • Download pi if not downloaded yet • CRM = CRM + rmi • Foreach pj linked to by pi: • rmj = rmj + d/ci rmi • rmi = 0

Greedy vs Simple • L-Neighbor is simple • RankMass is very greedy. • Update expensive: random access to web graph • Compromise? • Batching • downloads together • updates together

Windowed RankMass • Variables: • CRM: RankMass lower bound of crawled pages • rmi: Lower bound of PageRank of pi. • Crawl() • rmi = (1 − d)ti for each ti > 0 • While (CRM < 1 − ): • Download top window% pages according to rmi • Foreach page pi∈ DC • CRM = CRM + rmi • Foreach pj linked to by pi: • rmj = rmj + d/ci rmi • rmi = 0

Experimental Setup • HTML files only • Algorithms simulated over web graph • Crawled between Dec’ 2003 and Jan’ 2004 • 141 millon URLs span over 6.9 million host names • 233 top level domains.

Metrics Of Evaluation • How much RankMass is collected during the crawl • How much RankMass is “known” to have been collected during the crawl • How much computational and performance overhead the algorithm introduces.

L-Neighbor

RankMass

Windowed RankMass

Window Size

Algorithm Efficiency

Algorithm Running Time

Refresh Policies

Refresh Policy: Problem Definition • You have N urls • you want to keep fresh • Limited resources: f documents / second • Download order to maximize average freshness • What do you do? • Note: Can’t always know how the page really looks

The Optimal Solution • Depends on freshness definition • Freshness boolean: • A page can only be fresh or not • One small change deems it unfresh

Understand Freshness Better • Two page database • Pd changes daily • Pw changes once a week • We can refresh one page per week • How should we visit pages? • Uniform: Pd, Pd, Pd, Pd, Pd, Pd,… • Proportional: Pd,Pd,Pd,Pd,Pd,Pd,Pw • Other?

Proportional Often Not Good! • Visit fast changing e1  get 1/2 day of freshness • Visit slow changing e2  get 1/2 week of freshness • Visiting Pw is a better deal!

Optimal Refresh Frequency Problem Given and f , find that maximize

Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution

Ziv Bar-Yossef (Technion and Google)‏ Idit Keidar (Technion)‏ Uri Schonfeld (UCLA)‏ Do Not Crawl In The DUST:Different URLs Similar Text

DUST – Different URLs Similar Text Examples: Default Directory Files: “/index.html” “/” Domain names and virtual hosts “news.google.com”  “google.com/news” Aliases and symbolic links: “~shuri” “/people/shuri” Parameters with little effect on content ?Print=1 URL transformations: “/story_<num>”  “story?id=<num>” Even the WWW Gets Dusty

Reduce the crawl and indexing Avoid fetching the same document more than once Canonization for better ranking References to a document may be split among its aliases Avoid returning duplicate results Many algorithm which use URLs as unique ids will benefit Why Care about DUST?

Crawling The Web For a Search Engine