1 / 23

WebBase and Stanford Digital Library Project

WebBase and Stanford Digital Library Project. Junghoo Cho Stanford University. Technologies for Digital Libraries. Physical Barriers. Mobile Access. Economic Weaknesses. IP Infrastructure. Information Loss. Archival Repository. Information Overload. Value Filtering.

lwhitmire
Download Presentation

WebBase and Stanford Digital Library Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebBase and Stanford Digital Library Project Junghoo Cho Stanford University

  2. Technologies for Digital Libraries Physical Barriers • Mobile Access Economic Weaknesses • IP Infrastructure Information Loss • Archival Repository Information Overload • Value Filtering Service Heterogeneity • Interoperability 2

  3. Web Crawler Web Crawler Web Crawler Web Crawlers WebBase Architecture Client Client Webbase API WWW Retrieval Indexes Feature Repository Repository Multicast Engine Client Client Client Client 3

  4. What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages

  5. Crawling Issues (1) • Load at visited web sites • Load at crawlers • Scope of the crawl

  6. Crawling Issues (2) • Maintaining pages “fresh” • How does the web change over time? • What does fresh page/database mean? • How can we increase “freshness”?

  7. Outline • Web evolution experiments • Freshness metrics • Crawling policy

  8. Web Evolution Experiment • How often does a web page change? • What is the lifespan of a page? • How long does it take for 50% of the web to change?

  9. Experimental Setup • February 17 to June 24, 1999 • 270 sites visited (with permission) • identified 400 sites with highest “page rank” • contacted administrators • 720,000 pages collected • 3,000 pages from each site daily • start at root, visit breadth first (get new & old pages) • ran only 9pm - 6am, 10 seconds between site requests

  10. Is this correct? changes page visited 1 day Page Change • Example: 50 visits to page, 5 changes  average change interval = 50/5 = 10 days

  11. Average Change Interval fraction of pages

  12. Change Interval - By Domain fraction of pages

  13. Modeling Web Evolution • Poisson process with rate  • T is time to next event • fT(t) = e-t (t > 0)

  14. Change Interval of Pages for pages that change every 10 days on average fraction of changes with given interval Poisson model interval in days

  15. web database ei ei • Freshness of the database S at time t is F( S ; t ) = F( ei ; t ) ... ... N 1  N i=1 Change Metrics • Freshness • Freshness of page ei at time t is F( ei ; t ) = 1 if ei is up-to-date at time t 0 otherwise

  16. web database ei ei • Age of the database S at time t is A( S ; t ) = A( ei ; t ) ... ... N 1  N i=1 Change Metrics • Age • Age of page ei at time t is A( ei ; t ) = 0 if ei is up-to-date at time tt - (modification ei time) otherwise

  17. Time averages: t t 1 1   F( ei ) = lim F(ei ; t ) dt F( S ) = lim F(S ; t ) dt t t t t 0 0 similar for age... Change Metrics F(ei) 1 0 time A(ei) 0 time refresh update

  18. Trick Question • Two page database • e1 changes daily • e2 changes once a week • Can visit pages once a week • How should we visit pages? • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • e1 e2 e1 e2 e1 e2 ... [uniform] • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 ... [proportional] • ? e1 e1 e2 e2 web database

  19. Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2 is a better deal!

  20. Selecting Optimal Refresh Frequency • Analysis is complex • Shape of curve is the same in all cases • Holds for any distribution g(  )

  21. Optimal Refresh Frequency for Age • Analysis is also complex • Shape of curve is the same in all cases • Holds for any distribution g(  )

  22. Comparing Policies Based on Statistics from experiment and revisit frequency of every month

  23. Summary • Maintaining the collection fresh: • Web evolution experiment • Change metrics • Optimal policy • Intuitive policy does not always perform well • Should be careful in deciding revisit policy

More Related