450 likes | 462 Views
This study explores web crawling techniques for collecting large-scale web data, addressing issues such as load balancing, depth limitations, and page freshness. The research covers web evolution experiments and optimal refresh frequencies for maintaining page relevance.
E N D
Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University
What is a Crawler? initial urls init to visit urls get next url get page visited urls web extract urls web pages
Applications • Internet Search Engines • Google, AltaVista • Comparison Shopping Services • My Simon, BizRate • Data mining • Stanford Web Base, IBM Web Fountain
WebBase Crawler • Web Base Project • BackRub Crawler, PageRank • Google • New Web Base Crawler • 20,000 lines in C/C++ • 130M pages collected
Crawling Issues (1) • Load at visited web sites • Space out requests to a site • Limit number of requests to a site per day • Limit depth of crawl
? Crawling Issues (2) • Load at crawler • Parallelize initial urls init init to visit urls get next url get next url get page get page extract urls extract urls visited urls web pages
Solution: Visit “important” pages Intel visited pages Intel Crawling Issues (3) • Scope of crawl • Not enough space for “all” pages • Not enough time to visit “all” pages
Crawling Issues (4) • Replication • Pages mirrored at multiple locations
Crawling Issues (5) • Incremental crawling • How do we avoid crawling from scratch? • How do we keep pages “fresh”?
Summary of My Research • Load on sites [PAWS00] • Parallel crawler [Tech Report 01] • Page selection [WWW7] • Replicated page detection [SIGMOD00] • Page freshness [SIGMOD00] • Crawler architecture [VLDB00]
Outline of This Talk How can we maintain pages fresh? • How does the Web change? • What do we mean by “fresh” pages? • How should we refresh pages?
Web Evolution Experiment • How often does a Web page change? • How long does a page stay on the Web? • How long does it take for 50% of the Web to change? • How do we model Web changes?
Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “PageRank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests
Average Change Interval fraction of pages ¾ ¾ average change interval
Change Interval – By Domain fraction of pages ¾ ¾ average change interval
Modeling Web Evolution • Poisson process with rate • T is time to next event • fT(t) = e-t (t > 0)
Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days
web database ei ei • Freshness of the database S at time t isF( S ; t ) = F( ei ; t ) • (Assume “equal importance” of pages) ... ... N 1 N i=1 Change Metrics • Freshness • Freshness of element ei at time t isF ( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
web database ei ei • Age of the database S at time t isA( S ; t ) = A( ei ; t ) • (Assume “equal importance” of pages) ... ... N 1 N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise
Time averages: Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update
Refresh Order • Fixed order • Explicit list of URLs to visit • Random order • Start from seed URLs & follow links • Purely random • Refresh pages on demand, as requested by user database web ei ei ... ...
Freshness vs. Revisit Frequency r = / f = average change frequency / average visit frequency
= Age / time to refresh all N elements Age vs. Revisit Frequency r = / f = average change frequency / average visit frequency
Trick Question • Two page database • e1changes daily • e2changes once a week • Can visit one page per week • How should we visit pages? • e1e2 e1e2 e1e2 e1e2... [uniform] • e1e1e1e1e1e1e1e2e1e1 …[proportional] • e1e1e1e1e1e1 ... • e2e2e2e2e2e2 ... • ? e1 e1 e2 e2 web database
Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2is a better deal!
Optimal Refresh Frequency Problem Given and f , find that maximize
Solution • Compute • Lagrange multiplier method • All
Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution
Optimal Refresh for Age • Shape of curve is the same in all cases • Holds for any change frequency distribution
Comparing Policies Based on Statistics from experiment and revisit frequency of every month
Topics to Follow • Weighted Freshness • Non-Poisson Model • Change Frequency Estimation
In general, Not Every Page is Equal! Some pages are “more important” e1 Accessed by users 10 times/day e2 Accessed by users 20 times/day
Weighted Freshness f w = 2 w = 1 l
Heavy-tail distribution Poisson model Non-Poisson Model fraction of changes with given interval interval in days
f l Optimal Revisit Frequencyfor Heavy-Tail Distribution
Principle of Diminishing Return • T: time to next change • : continuous, differentiable • Every page changes • Definition of change rate l
Page changed Page visited 1 day Change detected Change Frequency Estimation • How to estimate change frequency? • Naïve Estimator: X/T • X: number of detected changes • T: monitoring period • 2 changes in 10 days: 0.2 times/day • Incomplete change history
Improved Estimator • Based on the Poisson model • X: number of detected changes • N: number of accesses • f : access frequency • 3 changes in 10 days: 0.36 times/day • Accounts for “missed” changes
Improved Estimator • Bias • Efficiency • Consistency
Improvement Significant? • Application to a Web crawler • Visit pages once every week for 5 weeks • Estimate change frequency • Adjust revisit frequency based on the estimate • Uniform: do not adjust • Naïve: based on the naïve estimator • Ours: based on our improved estimator
Improvement from Our Estimator (9,200,000 visits in total)
Other Estimators • Irregular access interval • Last-modified date • Categorization
Summary • Web evolution experiment • Change metric • Refresh policy • Frequency estimator
Contribution • Freshness [SIGMOD00] • Page selection [WWW7] • Replicated page detection [SIGMOD00] • Load on sites [PAWS00] • Parallel crawler [Tech Report 01] • Crawler architecture [VLDB00]
The End • Thank you for your attention • For more information visit http://www-db.stanford.edu/~cho/