1 / 58

Crawling The Web For a Search Engine

This talk outlines the importance of web crawling, including the role of crawlers, problems encountered, RankMass Crawler, refresh policies, and duplicate detection techniques. It delves into the challenges of crawling the web effectively and efficiently, with a focus on coverage, personalized PageRank, and achieving high RankMass. The content covers web coverage problems, PageRank importance, RankMass calculation methods, and the L-Neighbor Crawler approach for efficient crawling. The talk also discusses page-level prioritization strategies and simplifies the RankMass Crawler concept for practical implementation.

Download Presentation

Crawling The Web For a Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling The Web For a Search Engine Or Why Crawling is Cool

  2. Talk Outline • What is a crawler? • Some of the interesting problems • RankMass Crawler • As time permits: • Refresh Policies • Duplicate Detection

  3. What is a Crawler? initial urls init to visit urls get next url get page visited urls web extract urls web pages

  4. Applications • Internet Search Engines • Google, Yahoo, MSN, Ask • Comparison Shopping Services • Shopping • Data mining • Stanford Web Base, IBM Web Fountain

  5. Is that it? Not quite

  6. Duplicate Pages Mirror Sites Identifying Similar Pages Templates Deep Web When to stop? Incremental Crawler Refresh Policies Evolution of the Web Crawling the “good” pages first Focused Crawling Distributed Crawlers Crawler Friendly Webservers Crawling the Big Picture

  7. Today’s Focus • A crawler which guarantees coverage of the Web • As time permits: • Refresh Policies • Duplicate Detection Techniques

  8. RankMass Crawler A Crawler with High PersonalizedPageRank Coverage Guarantee

  9. Motivation • Impossible to download the entire web: • Example: many pages from one calendar • When can we stop? • How to gain the most benefit from the pages we download

  10. Main Issues • Crawler Guarantee: • guarantee on how much of the “important” part of the Web they “cover” when they stop crawling • If we don’t see the pages, how do we know how important they are? • Crawler Efficiency: • Download “important” pages early during a crawl • Obtain coverage with a min number of downloads

  11. Outline • Formalize coverage metric • L-Neighbor: Crawling with RankMass guarantee • RankMass: Crawling to achieve high RankMass • Windowed RankMass: How greedy do you want to be? • Experimental Results

  12. Web Coverage Problem • D – The potentially infinite set of documents of the web • DC – The finite set of documents in our document collection • Assign importance weights to each page

  13. Web Coverage Problem • What weights? Per query? Topic? Font? • PageRank? Why PageRank? • Useful as importance mesure • Random surfer. • Effective for ranking.

  14. PageRank a Short Review p2 p1 p3 p4

  15. Now it’s Personal • Personal, TrustRank, General p3 p4

  16. RankMass Defined • Using personalized pagerank formally define RankMass of DC : • Coverage Guarantee: • We seek a crawler that given , when it stops the downloaded pages DC: • Efficient crawling: • We seek a crawler that, for a given N, downloads |DC|=N s.t. RM(DC) is greater or equal to any other |DC|=N, DCD

  17. How to Calculate RankMass • Based on PageRank • How do you compute RM(Dc) without downloading the entire web • We can’t compute the exact but can lower bound • Let’s a start a simple case

  18. Single Trusted Page • T(1): t1=1 ; ti = 0 i≠1 • Always jump to p1 when bored • We can place a lowerbound on being within L of P1 • NL(p1)=pages reachable from p1 in L links

  19. Single Trusted Page

  20. Lower bound guarantee: Single Trusted • Theorem 1: • Assuming the trust vector T(1), the sum of the PageRank values of all L-neighbors of p1 is at least dL+1 close to 1.. That is:

  21. Lower bound guarantee: General Case • Theorem 2: • The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is:

  22. RankMass Lower Bound • Lower bound given a single trusted page • Extension: Given a set of trusted pages G • That’s the basis of the crawling algorithm with a coverage guarantee

  23. The L-Neighbor Crawler • L := 0 • N[0] = {pi|ti > 0} // Start with the trusted pages • While ( < dL+1) • Download all uncrawled pages in N[L] • N[L + 1] = {all pages linked to by a page in N[L]} • L = L + 1

  24. But what about efficency? • L-Neighbor similar to BFS • L-Neighbor simple and efficient • May wish to prioritize further certain neighborhoods first • Page level prioritization. t0=0.99 t1=0.01

  25. Page Level Prioritizing • We want a more fine-grained page-level priority • The idea: • Estimate PageRank on a page basis • High priority for pages with a high estimate of PageRank • We cannot calculate exact PageRank • Calculate PageRank lower bound of undownloaded pages • …But how

  26. Probability of being at Page P Click Link Interrupted Page Trusted Page Random Surfer

  27. Calculating PageRank Lower Bound • PageRank(p) = Probability Random Surfer in p • Breakdown path by “interrupts”, jumps to a trusted page • Sum up all paths that start with an interrupt and end with p Interrupt Pj P1 P2 P3 P4 P5 Pi (d*1/3) (d*1/3) (d*1/3) (d*1/5) (d*1/3) (d*1/3) (1-d) (tj)

  28. RankMass Basic Idea p1 0.99 p3 0.25 p6 0.09 p1 0.99 p3 0.25 p7 0.09 p4 0.25 p1 0.99 p1 0.99 p5 0.25 p2 0.01

  29. RankMass Crawler: High Level • But that sounds complicated?! • Luckily we don’t need all that • Based on this idea: • Dynamically update lower bound on PageRank • Update total RankMass • Download page with highest lower bound

  30. RankMass Crawler (Shorter) • Variables: • CRM: RankMass lower bound of crawled pages • rmi: Lower bound of PageRank of pi. • RankMassCrawl() • CRM = 0 • rmi = (1 − d)ti for each ti > 0 • While (CRM < 1 − ): • Pick pi with the largest rmi. • Download pi if not downloaded yet • CRM = CRM + rmi • Foreach pj linked to by pi: • rmj = rmj + d/ci rmi • rmi = 0

  31. Greedy vs Simple • L-Neighbor is simple • RankMass is very greedy. • Update expensive: random access to web graph • Compromise? • Batching • downloads together • updates together

  32. Windowed RankMass • Variables: • CRM: RankMass lower bound of crawled pages • rmi: Lower bound of PageRank of pi. • Crawl() • rmi = (1 − d)ti for each ti > 0 • While (CRM < 1 − ): • Download top window% pages according to rmi • Foreach page pi∈ DC • CRM = CRM + rmi • Foreach pj linked to by pi: • rmj = rmj + d/ci rmi • rmi = 0

  33. Experimental Setup • HTML files only • Algorithms simulated over web graph • Crawled between Dec’ 2003 and Jan’ 2004 • 141 millon URLs span over 6.9 million host names • 233 top level domains.

  34. Metrics Of Evaluation • How much RankMass is collected during the crawl • How much RankMass is “known” to have been collected during the crawl • How much computational and performance overhead the algorithm introduces.

  35. L-Neighbor

  36. RankMass

  37. Windowed RankMass

  38. Window Size

  39. Algorithm Efficiency

  40. Algorithm Running Time

  41. Refresh Policies

  42. Refresh Policy: Problem Definition • You have N urls • you want to keep fresh • Limited resources: f documents / second • Download order to maximize average freshness • What do you do? • Note: Can’t always know how the page really looks

  43. The Optimal Solution • Depends on freshness definition • Freshness boolean: • A page can only be fresh or not • One small change deems it unfresh

  44. Understand Freshness Better • Two page database • Pd changes daily • Pw changes once a week • We can refresh one page per week • How should we visit pages? • Uniform: Pd, Pd, Pd, Pd, Pd, Pd,… • Proportional: Pd,Pd,Pd,Pd,Pd,Pd,Pw • Other?

  45. Proportional Often Not Good! • Visit fast changing e1  get 1/2 day of freshness • Visit slow changing e2  get 1/2 week of freshness • Visiting Pw is a better deal!

  46. Optimal Refresh Frequency Problem Given and f , find that maximize

  47. Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution

  48. Ziv Bar-Yossef (Technion and Google)‏ Idit Keidar (Technion)‏ Uri Schonfeld (UCLA)‏ Do Not Crawl In The DUST:Different URLs Similar Text

  49. DUST – Different URLs Similar Text Examples: Default Directory Files: “/index.html” “/” Domain names and virtual hosts “news.google.com”  “google.com/news” Aliases and symbolic links: “~shuri” “/people/shuri” Parameters with little effect on content ?Print=1 URL transformations: “/story_<num>”  “story?id=<num>” Even the WWW Gets Dusty

  50. Reduce the crawl and indexing Avoid fetching the same document more than once Canonization for better ranking References to a document may be split among its aliases Avoid returning duplicate results Many algorithm which use URLs as unique ids will benefit Why Care about DUST?

More Related