1 / 56

Lazy Preservation, Warrick, and the Web Infrastructure

Lazy Preservation, Warrick, and the Web Infrastructure. Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 19, 2007. Outline. What is the Web Infrastructure (WI)? How can the WI be used for preservation?

oriana
Download Presentation

Lazy Preservation, Warrick, and the Web Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion UniversityComputer Science DepartmentNorfolk, Virginia, USAJCDL 2007 Vancouver, BCJune 19, 2007

  2. Outline • What is the Web Infrastructure (WI)? • How can the WI be used for preservation? • Web-repository crawling with Warrick • Understanding the WI • Caching experiment • Reconstruction experiments • Search engine sampling and IA overlap experiment • Recovering web server components from the WI • Brass: Queueing manager for Warrick

  3. Web Infrastructure

  4. Alternative Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources

  5. Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

  6. Crawling the Crawlers

  7. Cached Image

  8. Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

  9. Web-repository Crawler

  10. McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. • McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. • McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. • McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at http://warrick.cs.odu.edu/

  11. What Types of Websites Are Lost? Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.

  12. Outline • What is the Web Infrastructure (WI)? • How can the WI be used for preservation? • Web-repository crawling with Warrick • Understanding the WI • Caching experiment • Reconstruction experiments • Search engine sampling and IA overlap experiment • Recovering web server components from the WI • Brass: Queueing manager for Warrick

  13. Understanding the WI • How quickly do search engines acquire and purge their caches? • Do search engines prefer caching one type of resource over another? • How much overlap is there between the search engines caches and IA holdings? • How successfully can we reconstruct a lost website? • Are some resources more recoverable than others?

  14. Timeline of Web Resource

  15. Web Caching Experiment • Create 4 websites composed of HTML, PDFs, and images • http://www.owenbrau.com/ • http://www.cs.odu.edu/~fmccown/lazy/ • http://www.cs.odu.edu/~jsmit/ • http://www.cs.odu.edu/~mln/lazp/ • Remove pages each day • Query GMY every day using identifiers McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

  16. Where is the Internet Archive? • No crawls from Alexa, IA’s provider • Even if they had crawled us, the content would not be accessible from IA for 6-12 months • Short-lived web content is likely to be lost for good

  17. 2005 Reconstruction Experiment • Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources) • Perform 5 reconstructions for each website • One using all four repositories together • Four using each repository separately • Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

  18. How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Four categories of recovered resources: 1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G Missing link to D; points to old resource G F can’t be found

  19. Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

  20. Recovery Success by MIME Type

  21. Repository Contributions

  22. 2006 Reconstruction Experiment • 300 websites chosen randomly from Open Directory Project (dmoz.org) • Crawled and reconstructed each website every week for 14 weeks • Examined change rates, age, decay, growth, recoverability McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

  23. Success of website recovery each week *On average, we recovered 61% of a website on any given week.

  24. Statistics for Repositories

  25. Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

  26. Distribution of Top Level Domains

  27. 976 KB 977 KB 215 KB 1 MB Cached Resource Size Distributions

  28. Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-modified http header – cached date)

  29. Cache Staleness • 46% of resource had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale

  30. Similarity vs. Staleness

  31. Internet Archive? How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

  32. Overlap with Internet Archive

  33. Overlap with Internet Archive

  34. Distribution of Sampled URLs

  35. Problem: WI currently only stores the client-side representation of a website. Server components (scripts, databases, configuration files, etc.) are not accessible from the WI

  36. Outline • What is the Web Infrastructure (WI)? • How can the WI be used for preservation? • Web-repository crawling with Warrick • Understanding the WI • Caching experiment • Reconstruction experiments • Search engine sampling and IA overlap experiment • Recovering web server components from the WI • Brass: Queueing manager for Warrick

  37. Web Server Static files(html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Recoverable config Perlscript Dynamicpage Database Not Recoverable

  38. Injecting Server Components into Crawlable Pages Erasure codes HTML pages Recover at least m blocks

  39. Brass: A Queueing Manager for Warrick • Warrick requires some technical expertise to download, install, and run • Warrick uses search engine APIs which allow limited requests per IP address (or key) • Google no longer provides new keys for accessing their API

More Related