230 likes | 259 Views
WebBase and Stanford Digital Library Project. Junghoo Cho Stanford University. Technologies for Digital Libraries. Physical Barriers. Mobile Access. Economic Weaknesses. IP Infrastructure. Information Loss. Archival Repository. Information Overload. Value Filtering.
E N D
WebBase and Stanford Digital Library Project Junghoo Cho Stanford University
Technologies for Digital Libraries Physical Barriers • Mobile Access Economic Weaknesses • IP Infrastructure Information Loss • Archival Repository Information Overload • Value Filtering Service Heterogeneity • Interoperability 2
Web Crawler Web Crawler Web Crawler Web Crawlers WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Repository Multicast Engine Client Client Client Client 3
What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages
Crawling Issues (1) • Load at visited web sites • Load at crawlers • Scope of the crawl
Crawling Issues (2) • Maintaining pages “fresh” • How does the web change over time? • What does fresh page/database mean? • How can we increase “freshness”?
Outline • Web evolution experiments • Freshness metrics • Crawling policy
Web Evolution Experiment • How often does a web page change? • What is the lifespan of a page? • How long does it take for 50% of the web to change?
Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “page rank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests
Is this correct? changes page visited 1 day Page Change • Example: 50 visits to page, 5 changes average change interval = 50/5 = 10 days
Average Change Interval fraction of pages
Change Interval - By Domain fraction of pages
Modeling Web Evolution • Poisson process with rate • T is time to next event • fT(t) = e-t (t > 0)
Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days
web database ei ei • Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) ... ... N 1 N i=1 Change Metrics • Freshness • Freshness of page ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise
web database ei ei • Age of the database S at time t is A( S ; t ) = A( ei ; t ) ... ... N 1 N i=1 Change Metrics • Age • Age of page ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise
Time averages: t t 1 1 F( ei ) = lim F(ei ; t ) dt F( S ) = lim F(S ; t ) dt t t t t 0 0 similar for age... Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update
Trick Question • Two page database • e1 changes daily • e2 changes once a week • Can visit pages once a week • How should we visit pages? • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • e1 e2 e1 e2 e1 e2 ... [uniform] • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] • ? e1 e1 e2 e2 web database
Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2 is a better deal!
Selecting Optimal Refresh Frequency • Analysis is complex • Shape of curve is the same in all cases • Holds for any distribution g( )
Optimal Refresh Frequency for Age • Analysis is also complex • Shape of curve is the same in all cases • Holds for any distribution g( )
Comparing Policies Based on Statistics from experiment and revisit frequency of every month
Summary • Maintaining the collection fresh: • Web evolution experiment • Change metrics • Optimal policy • Intuitive policy does not always perform well • Should be careful in deciding revisit policy