1 / 45

Crawling the Web

Crawling the Web. Discovery and Maintenance of Large-Scale Web Data. Junghoo Cho Stanford University. What is a Crawler?. initial urls. init. to visit urls. get next url. get page. visited urls. web. extract urls. web pages. Applications. Internet Search Engines Google, AltaVista

tmcneely
Download Presentation

Crawling the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University

  2. What is a Crawler? initial urls init to visit urls get next url get page visited urls web extract urls web pages

  3. Applications • Internet Search Engines • Google, AltaVista • Comparison Shopping Services • My Simon, BizRate • Data mining • Stanford Web Base, IBM Web Fountain

  4. WebBase Crawler • Web Base Project • BackRub Crawler, PageRank • Google • New Web Base Crawler • 20,000 lines in C/C++ • 130M pages collected

  5. Crawling Issues (1) • Load at visited web sites • Space out requests to a site • Limit number of requests to a site per day • Limit depth of crawl

  6. ? Crawling Issues (2) • Load at crawler • Parallelize initial urls init init to visit urls get next url get next url get page get page extract urls extract urls visited urls web pages

  7. Solution: Visit “important” pages Intel visited pages Intel Crawling Issues (3) • Scope of crawl • Not enough space for “all” pages • Not enough time to visit “all” pages

  8. Crawling Issues (4) • Replication • Pages mirrored at multiple locations

  9. Crawling Issues (5) • Incremental crawling • How do we avoid crawling from scratch? • How do we keep pages “fresh”?

  10. Summary of My Research • Load on sites [PAWS00] • Parallel crawler [Tech Report 01] • Page selection [WWW7] • Replicated page detection [SIGMOD00] • Page freshness [SIGMOD00] • Crawler architecture [VLDB00]

  11. Outline of This Talk How can we maintain pages fresh? • How does the Web change? • What do we mean by “fresh” pages? • How should we refresh pages?

  12. Web Evolution Experiment • How often does a Web page change? • How long does a page stay on the Web? • How long does it take for 50% of the Web to change? • How do we model Web changes?

  13. Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “PageRank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests

  14. Average Change Interval fraction of pages ¾ ¾ average change interval

  15. Change Interval – By Domain fraction of pages ¾ ¾ average change interval

  16. Modeling Web Evolution • Poisson process with rate  • T is time to next event • fT(t) = e-t (t > 0)

  17. Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

  18. web database ei ei • Freshness of the database S at time t isF( S ; t ) = F( ei ; t ) • (Assume “equal importance” of pages) ... ... N 1  N i=1 Change Metrics • Freshness • Freshness of element ei at time t isF ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

  19. web database ei ei • Age of the database S at time t isA( S ; t ) = A( ei ; t ) • (Assume “equal importance” of pages) ... ... N 1  N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise

  20. Time averages: Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update

  21. Refresh Order • Fixed order • Explicit list of URLs to visit • Random order • Start from seed URLs & follow links • Purely random • Refresh pages on demand, as requested by user database web ei ei ... ...

  22. Freshness vs. Revisit Frequency r =  / f = average change frequency / average visit frequency

  23. = Age / time to refresh all N elements Age vs. Revisit Frequency r =  / f = average change frequency / average visit frequency

  24. Trick Question • Two page database • e1changes daily • e2changes once a week • Can visit one page per week • How should we visit pages? • e1e2 e1e2 e1e2 e1e2... [uniform] • e1e1e1e1e1e1e1e2e1e1 …[proportional] • e1e1e1e1e1e1 ... • e2e2e2e2e2e2 ... • ? e1 e1 e2 e2 web database

  25. Proportional Often Not Good! • Visit fast changing e1  get 1/2 day of freshness • Visit slow changing e2  get 1/2 week of freshness • Visiting e2is a better deal!

  26. Optimal Refresh Frequency Problem Given and f , find that maximize

  27. Solution • Compute • Lagrange multiplier method • All

  28. Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution

  29. Optimal Refresh for Age • Shape of curve is the same in all cases • Holds for any change frequency distribution

  30. Comparing Policies Based on Statistics from experiment and revisit frequency of every month

  31. Topics to Follow • Weighted Freshness • Non-Poisson Model • Change Frequency Estimation

  32. In general, Not Every Page is Equal!  Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day

  33. Weighted Freshness f w = 2 w = 1 l

  34. Heavy-tail distribution Poisson model Non-Poisson Model fraction of changes with given interval interval in days

  35. f l Optimal Revisit Frequencyfor Heavy-Tail Distribution

  36. Principle of Diminishing Return • T: time to next change • : continuous, differentiable • Every page changes • Definition of change rate l

  37. Page changed Page visited 1 day Change detected Change Frequency Estimation • How to estimate change frequency? • Naïve Estimator: X/T • X: number of detected changes • T: monitoring period • 2 changes in 10 days: 0.2 times/day • Incomplete change history

  38. Improved Estimator • Based on the Poisson model • X: number of detected changes • N: number of accesses • f : access frequency • 3 changes in 10 days: 0.36 times/day •  Accounts for “missed” changes

  39. Improved Estimator • Bias • Efficiency • Consistency

  40. Improvement Significant? • Application to a Web crawler • Visit pages once every week for 5 weeks • Estimate change frequency • Adjust revisit frequency based on the estimate • Uniform: do not adjust • Naïve: based on the naïve estimator • Ours: based on our improved estimator

  41. Improvement from Our Estimator (9,200,000 visits in total)

  42. Other Estimators • Irregular access interval • Last-modified date • Categorization

  43. Summary • Web evolution experiment • Change metric • Refresh policy • Frequency estimator

  44. Contribution • Freshness [SIGMOD00] • Page selection [WWW7] • Replicated page detection [SIGMOD00] • Load on sites [PAWS00] • Parallel crawler [Tech Report 01] • Crawler architecture [VLDB00]

  45. The End • Thank you for your attention • For more information visit http://www-db.stanford.edu/~cho/

More Related