770 likes | 930 Views
Thinking Differently About Web Page Preservation. Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA {mln,fmccown,jsmit}@cs.odu.edu Library of Congress Brown Bag Seminar June 29, 2006
E N D
Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA {mln,fmccown,jsmit}@cs.odu.edu Library of Congress Brown Bag Seminar June 29, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation
Background • “We can’t save everything!” • if not “everything”, then how much? • what does “save” mean?
“Women and Children First” HMS Birkenhead, Cape Danger, 1852 638 passengers 193 survivors all 7 women & 13 children image from: http://www.btinternet.com/~palmiped/Birkenhead.htm
We should probably save a copy of this…
Or maybe we don’t have to… the Wikipedia link is in the top 10, so we’re ok, right?
Surely we’re saving copies of this…
2 copies in the UK 2 Dublin Core records That’s probably good enough…
What about the things that we know we don’t need to keep? You DO support recycling, right?
A higher moral calling for pack rats?
Preservation metadata is like a David Hockney Polaroid collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence” Lessons Learned from the AIHT (Boring stuff: D-Lib Magazine, December 2005) images from: http://facweb.cs.depaul.edu/sgrais/collage.htm
Preservation: Fortress Model Five Easy Steps for Preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
Alternate Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm
Lazy Preservation“How much preservation do I get if I do nothing?” Frank McCown
Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus
Publisher’s cost (time, equipment, knowledge) Client-view Server-view H Filesystem backups Furl/Spurl Browser cache InfoMonitor LOCKSS Hanzo:web iPROXY TTApache Web archivesSE caches H L H Coverage of the Web Cost of Preservation
Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus
Research Questions • How much digital preservation of websites is afforded by lazy preservation? • Can we reconstruct entire websites from the WI? • What factors contribute to the success of website reconstruction? • Can we predict how much of a lost website can be recovered? • How can the WI be utilized to provide preservation of server-side components?
Prior Work • Is website reconstruction from WI feasible? • Web repository: G,M,Y,IA • Web-repository crawler: Warrick • Reconstructed 24 websites • How long do search engines keep cached content after it is removed?
Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (tca is not defined) Replicated resource – available on web server and SE cache (tca < current time < tr) Endangered resource – removed from web server but still cached (tca < current time < tcr) Unrecoverable resource – missing from web server and cache (tca< tcr< current time) Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.
Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version
Web Repository Characteristics C Canonical version is stored M Modified version is stored (modified images are thumbnails, all others are html conversions) ~R Indexed but not retrievable ~S Indexed but not stored
SE Caching Experiment • Create html, pdf, and images • Place files on 4 web servers • Remove files on regular schedule • Examine web server logs to determine when each page is crawled and by whom • Query each search engine daily using unique identifier to see if they have cached the page or image Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)
Reconstructing a Website Original URL Web Repo Warrick Starting URL Results page Cached URL Retrieved resource File system Cached resource • Pull resources from all web repositories • Strip off extra header and footer html • Store most recently cached version or canonical version • Parse html for links to other resources
How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Missing link to D; points to old resource G F can’t be found
Reconstruction Diagram added 20% changed 33% missing 17% identical 50%
Websites to Reconstruct • Reconstruct 24 sites in 3 categories: 1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources) • Use Wget to download current website • Use Warrick to reconstruct • Calculate reconstruction vector
Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.
Warrick Milestones • www2006.org – first lost website reconstructed (Nov 2005) • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially “blesses” Warrick (mid Mar 2006)1 1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html
Outline: Lazy Preservation • Web Infrastructure as a Resource • Reconstructing Web Sites • Research Focus
Proposed Work • How lazy can we afford to be? • Find factors influencing success of website reconstruction from the WI • Perform search engine cache characterization • Inject server-side components into WI for complete website reconstruction • Improving the Warrick crawler • Evaluate different crawling policies • Frank McCownand Michael L. Nelson, Evaluation of Crawling Policies for a Web-repository Crawler, ACM Hypertext 2006. • Development of web-repository API for inclusion in Warrick
Factors Influencing Website Recoverability from the WI • Previous study did not find statistically significant relationship between recoverability and website size or PageRank • Methodology • Sample large number of websites - dmoz.org • Perform several reconstructions over time using same policy • Download sites several times over time to capture change rates
Evaluation • Use statistical analysis to test for the following factors: • Size • Makeup • Path depth • PageRank • Change rate • Create a predictive model – how much of my lost website do I expect to get back?
We can recover the missing page and PDF, but what about the services?
Recovery of Web Server Components • Recovering the client-side representation is not enough to reconstruct a dynamically-produced website • How can we inject the server-side functionality into the WI? • Web repositories like HTML • Canonical versions stored by all web repos • Text-based • Comments can be inserted without changing appearance of page • Injection: Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages
Evaluation • Find the most efficient values for n and r (chunks created/recovered) • Security • Develop simple mechanism for selecting files that can be injected into the WI • Address encryption issues • Reconstruct an EPrints website with a few hundred resources
SE Cache Characterization • Web characterization is an active field • Search engine caches have never been characterized • Methodology • Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask • Download cached version and live version from the Web • Examine HTTP headers and page content • Test for overlap with Internet Archive • Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache
Summary: Lazy Preservation When this work is completed, we will have… • demonstrated and evaluated the lazy preservation technique • provided a reference implementation • characterized SE caching behavior • provided a layer of abstraction on top of SE behavior (API) • explored how much we store in the WI (server-side vs. client-side representations)
Web Server Enhanced Preservation“How much preservation do I get if I do just a little bit?” Joan A. Smith