Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

WIDM 2006 Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion UniversityNorfolk, Virginia, USA Arlington, VirginiaNovember 10, 2006

Outline • Web page threats • Web Infrastructure • Web caching experiment • Web repository crawling • Website reconstruction experiment

Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

Cached Image

Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version

Web Repository Characteristics C Canonical version is stored M Modified version is stored (modified images are thumbnails, all others are html conversions) ~R Indexed but not retrievable ~S Indexed but not stored

Timeline of Web Resource

Web Caching Experiment • Create 4 websites composed of HTML, PDF, images • http://www.owenbrau.com/ • http://www.cs.odu.edu/~fmccown/lazy/ • http://www.cs.odu.edu/~jsmit/ • http://www.cs.odu.edu/~mln/lazp/ • Remove pages each day • Query GMY each day using identifiers

Crawling the Web and web repositories

First developed in fall of 2005 • Available for download at http://www.cs.odu.edu/~fmccown/warrick/ • www2006.org – first lost website reconstructed (Nov 2005) • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006) • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006) • Internet Archive officially endorses Warrick (mid Mar 2006)

How Much Did We Reconstruct? “Lost” web site Reconstructed web site A A B’ C’ F B C G E D E F Four categories of recovered resources: 1) Identical: A, E2) Changed: B, C3) Missing: D, F4) Added: G Missing link to D; points to old resource G F can’t be found

Reconstruction Diagram added 20% changed 33% missing 17% identical 50%

Reconstruction Experiment • Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium (151-499 resources)3. large (500+ resources) • Perform 5 reconstructions for each website • One using all four repositories together • Four using each repository separately • Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

Recovery Success by MIME Type

Repository Contributions

Current & Future Work • Building a web interface for Warrick • Currently crawling & reconstructing 300 randomly sampled websites each week • Move from descriptive model to proscriptive & predictive model • Injecting server-side functionality into WI • Recover the PHP code, not just the HTML

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers