390 likes | 405 Views
Uncover the challenges and tactics of web crawling for effective information retrieval, discussing web evolution experiments, data metrics, and various crawling methods. Find out how to enhance freshness and efficiency in crawling.
E N D
How to Crawl the WebHector Garcia-MolinaStanford UniversityJoint work with Junghoo Cho
Stanford InterLib Technologies Physical Barriers • Mobile Access Economic Concerns • IP Infrastructure Information Loss • Archival Repository Information Overload • Value Filtering Service Heterogeneity • Interoperability
Web Crawler Web Crawler Web Crawler Web Crawlers WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Repository Multicast Engine Client Client Client Client
What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages
Crawling at Stanford • WebBase Project • BackRub Search Engine, Page Rank • Google • New WebBase Crawler
Crawling Issues (1) • Load at visited web sites • robots.txt files • space out requests to a site • limit number of requests to a site per day • limit depth of crawl • multicast
? initial urls init init to visit urls get next url get next url get page get page extract urls extract urls visited urls web pages Crawling Issues (2) • Load at crawler • parallelize
Solution: Visit “important” pages microsoft visited pages microsoft Crawling Issues (3) • Scope of crawl • not enough space for “all” pages • not enough time to visit “all” pages
Focus of this talk... Crawling Issues (4) • Incremental crawling • how do we keep pages “fresh”? • how do we avoid crawling from scratch?
Web Evolution Experiment • How often does a web page change? • What is the lifespan of a page? • How long does it take for 50% of the web to change? • How do we model web changes?
Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “page rank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests
Is this correct? changes page visited 1 day How Often Does a Page Change? • Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Average Change Interval fraction of pages
Average Change Interval — By Domain fraction of pages
page lifetime page lifetime experiment duration experiment duration page lifetime page lifetime experiment duration experiment duration How Long Does a Page Live?
Page Lifespans fraction of pages
Page Lifespans fraction of pages Method 1 used
Time for a 50% Change fraction of unchanged pages days
Modeling Web Evolution • Poisson process with rate • T is time to next event • fT(t) = e-t (t > 0)
Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days
web database ei ei • Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) ... ... N 1 N i=1 Change Metrics • Freshness • Freshness of element ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
web database ei ei • Age of the database S at time t is A( S ; t ) = A( ei ; t ) ... ... N 1 N i=1 Change Metrics • Age • Age of element ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise
Time averages: t t 1 1 F( ei ) = lim F(ei ; t ) dt F( S ) = lim F(S ; t ) dt t t t t 0 0 similar for age... Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update
Crawler Types • In-place vs. shadow • Steady vs. batch shadow database web database ei ei ei ... ... ... crawler on crawler off time
Comparison: Batch vs. Steady batch mode in-place crawler crawler running steady in-place crawler
Shadowing Steady Crawler crawler’s collection without shadowing current collection
Shadowing Batch Crawler crawler’s collection without shadowing current collection
0.63 0.50 1 2 Experimental Data Freshness • Pages change on average every 4 months • Batch crawler works one week out of 4
Refresh Order • Fixed order • Example: Explicit list of URLs to visit • Random Order • Example: Start from seed URLs & follow links • Purely Random • Example: Refresh pages on demand, as requested by user database web ei ei ... ...
Freshness vs. Order steady, in-place crawler, pages change at same average rate r = / f = average change frequency / average visit frequency
Age vs. Order = Age / time to refresh all N elements r = / f = average change frequency / average visit frequency
g( ) Non-Uniform Change Rates fraction of elements
Trick Question • Two page database • e1 changes daily • e2 changes once a week • Can visit pages once a week • How should we visit pages? • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • e1 e2 e1 e2 e1 e2 ... [uniform] • e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] • ? e1 e1 e2 e2 web database
Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2 is a better deal!
Selecting Optimal Refresh Frequency • Analysis is complex • Shape of curve is the same in all cases • Holds for any distribution g( )
Optimal Refresh Frequency for Age • Analysis is also complex • Shape of curve is the same in all cases • Holds for any distribution g( )
Comparing Policies In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.]
Building an Incremental Crawler • Steady • In-place • Variable visit frequencies • Fixed order improves freshness! • Also need: • Page change frequency estimator • Page replacement policy • Page ranking policy (what are ‘important” pages?) • See papers....
The End • Thank you for your attention... • For more information, visithttp://www-db.stanford.eduFollow “Searchable Publications” link, andsearch for author = “Cho”