1 / 32

Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University

How much preservation do I get if I do absolutely nothing? Using the Web Infrastructure for Digital Preservation. Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA, USA {mln,fmccown,jsmit,mklein}@cs.odu.edu Media Production Berlin 2006

desma
Download Presentation

Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How much preservation do I get if I do absolutely nothing?Using the Web Infrastructure for Digital Preservation Michael L. Nelson, Frank McCown, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA, USA {mln,fmccown,jsmit,mklein}@cs.odu.edu Media Production Berlin 2006 Berlin, Germany December 8, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation

  2. Preservation: Fortress Model Five Easy Steps for Preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look on my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

  3. Alternate Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

  4. Lazy Preservation

  5. Research Questions • How much digital preservation of websites is afforded by lazy preservation? • Can we reconstruct entire websites from the WI? • What factors contribute to the success of website reconstruction? • Can we predict how much of a lost website can be recovered? • How can the WI be utilized to provide preservation of server-side components?

  6. Warrick: Crawling the Crawlers • Is website reconstruction from WI feasible? • Web repository: G,M,Y,IA • Reconstructed 24 websites • How long do search engines keep cached content after it is removed?

  7. SE Caching Experiment • Create html, pdf, and images • Place files on 4 web servers • Remove files on regular schedule • Examine web server logs to determine when each page is crawled and by whom • Query each search engine daily using unique identifier to see if they have cached the page or image Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)

  8. Caching of HTML Resources - mln

  9. Reconstructing a Website Original URL Web Repo Warrick Starting URL Results page Cached URL Retrieved resource File system Cached resource • Pull resources from all web repositories • Strip off extra header and footer html • Store most recently cached version or canonical version • Parse html for links to other resources

  10. How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Missing link to D; points to old resource G F can’t be found

  11. Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

  12. Websites to Reconstruct • Reconstruct 24 sites in 3 categories: 1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources) • Use Wget to download current website • Use Warrick to reconstruct • Calculate reconstruction vector

  13. Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

  14. Web Repository Contributions

  15. Warrick Milestones • www2006.org – first lost website reconstructed (Nov 2005) • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially “blesses” Warrick (mid Mar 2006)1 1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

  16. Shared Infrastructure Preservation (slightly less lazy)

  17. Shared, Existing Infrastructure • Can we (re)use existing installed network infrastructure for preservation purposes? Who has the Bigger Fortress?

  18. Research Objective • Premise: use common Internet Protocol implementations to replicate repository contents • Inject the contents of an OAI-PMH repository directly into: • Email (SMTP) • Usenet News (NNTP) • Instrument existing email, news servers • Use mod_oai (www.modoai.org) to do resource harvesting • complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML” • results are generalizable to any repository system • Analyze testbed, simulate very large collections

  19. Prototype Architecture complex objects

  20. Test Repository • Website with 72 files • HTML, PDF, PNG, JPEG, GIF • 1KB - 1.5 MB • Used a script to harvest the MPEG-21 DIDLs, and then: • attach to outbound email mesgs • post to a moderated newsgroup (repository.odu.test1)

  21. OAI-PMH & HTTP headers original email mesg base64 encoded DIDL Email Headers

  22. OAI-PMH & HTTP headers base64 encoded DIDL News Posting

  23. Repository 100,000 items 1MB/item 100 daily additions 400 daily updates Time 2000 days (5.5 years) Email granularity=1 follows ODU power law example News servers hold contents for 30 days Simulation Parameters

  24. News Policies

  25. NNTP Results

  26. SMTP Policies • passive, “piggybacking” • History list of receiver domains • not maintained; history pointer off • duplicates • maintained; history pointer on • no duplicates • Granularity Filter for emails • every Gth email will be processed

  27. SMTP Results no history pointer with history pointer G = 1

  28. Summary • Shared Infrastructure Preservation provides a communications channel with unknown, future trading partners • SMTP approach is only feasible for “advertising” the existence of the repository • NNTP approach is promising for holding content • Lazy Preservation has been used to restore several dozen websites • but is it an archival strategy? depends on your tolerance for risk • prediction: search engines will see preservation as a business opportunity

More Related