300 likes | 407 Views
Just-In-Time Recovery of Missing Web Pages. Hypertext 2006 Odense, Denmark August 25, 2006 Terry L. Harrison & Michael L. Nelson Old Dominion University Norfolk VA, USA. Preservation: Fortress Model. Five Easy Steps for Preservation:. Get a lot of $
E N D
Just-In-Time Recovery of Missing Web Pages Hypertext 2006 Odense, Denmark August 25, 2006 Terry L. Harrison & Michael L. Nelson Old Dominion University Norfolk VA, USA
Preservation: Fortress Model Five Easy Steps for Preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
Alternate Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Find a “good enough” replacement web page • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm
Outline • The 404 problem • Component technologies • web infrastructure • lexical signatures • OAI-PMH • Opal • architectural description • analysis
404 Problem • Kahle (97) - Average page lifetime 44 days • Koehler (99, 04) - 67% URLs lost in 4 years • Lawrence et al. (01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) • Spinellis (03) - 27% URLs in CACM/Computer papers gone in 5 years • Chan et al. (03) - 11 year half-life for URLs in D-Lib Magazine articles • Nelson & Allen (02) - 3% objects in digital library gone in 1 year ECDL 1999 “good enough” page available PSP 2003 exact copy at new URL Greynet 99 unavailable at any URL?
Lexical Signatures • “Robust Hyperlinks Cost Just Five Words Each” • Phelps & Wilensky (2000) http://www.cs.odu.edu/~tharriso/?lex-sig=terry+harrison+thesis+jcdl+awarded • “Analysis of Lexical Signatures for Improving Information Presence on the World Wide Web” • Park et al. (2004)
OAI-PMH Data Providers / Repositories Service Providers / Harvesters “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.”
OAI-PMH Aggregators • aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) service providers (harvesters) aggregator
Observations • One reason why the original Phelps & Wilensky vision was never realized is that required a priori LS calculation • idea: use the Web Infrastructure to calculate LSs as they are needed • Mass adoption of a system will occur only if it is really, really easy to do so • idea: digital preservation systems should require only a small number of “heroes”
Description & Use Cases • Allow many web servers to use a few Opal servers that use the caches of the Web Infrastructure to generate Lexical Signatures of recently 404 URLs to find either: • the same page at a new URL • example: bookmarked colleague is now 404 • cached info is not useful • similar pages probably not useful • a “good enough” replacement page • example: bookmarked recipe is now 404 • cached info is useful • similar pages probably useful
Opal Configuration: “Configure Two Things” edit httpd.conf add / edit custom 404 page
Opal High-Level Architecture 1. Get URL X Interactive User www.bar.org 2. Custom 404 page 3. Pagetag redirects User to Opal server 5. Opal gives user navigation options 4. Opal searches WI caches; creates LS opal.foo.edu
Locating Caches http://www.google.com/search?hl=en&ie=ISO-8859-1&q=http://www.cs.odu.edu/~tharriso http://search.yahoo.com/search?fr=FP-pull-web-t&ei=UTF8&p=http://www.cs.odu.edu/~tharriso
WI Caches Last 7-51 days* • IA caches forever, but: • may not ever crawl you • ~12 month latency • no internal backups * Frank McCown, Joan A. Smith, Michael L. Nelson, Johan Bollen, Reconstructing Websites for the Lazy Webmaster, arXiv cs.IR/0512069, 2005. http://arxiv.org/abs/cs.IR/0512069
Term Frequency Inverse Document Frequency • Calculating Term Frequency is easy • frequency of term in this document • Calculating Document Frequency is hard • frequency of term in all documents • assumes knowledge of entire corpus! • “Good terms” appear: • frequently in a single document • infrequently across all documents
Scraping Google to Approximate DF • Frequency of term across all documents: • How many documents?
GUI (cont) • <url:similarURL datestamp="2005-05-13" votes="1" • simURL="http://www.cs.odu.edu/~tharriso/" baseURL="http://invivo_test.com"> • <![CDATA[<p class=g> • <a href="javascript:popUp('demo_dev.pl?method=vote&url=http://www.cs.odu.edu/~tharriso • &match=http://www.cs.odu.edu/~tharriso/')"> • <b>Terry</b> <b>Harrison</b> Profile Page</a><br><font size=-1>Burning Man Images Other Images • (not really well sorted, sorry!) Email <b>Terry</b> <b>...</b><br> • (May 2003), AR Zipf Fellowship <b>Awarded</b> to <b>Terry</b> <b>Harrison</b> - Press Release • <b>...</b><br><font color=#008000>www.cs.odu.edu/~tharriso/ - 12k - </font></font>]]> • </url:similarURL>
Opal Server Databases • URL database • 404 URL (LS, similarURL1, similarURL2, …, similarURLN) • similarURL (URL, datestamp, votes, Opal server) • Term database • term (Opal server, source, datestamp, DF, corpus size, IDF) Define each URL and Term as OAI-PMH Records and we can harvest what an Opal server has “learned” - can accommodate late arrivers (no “cold start” for them) - pool the learning of multiple servers - incentives to cooperate
Opal A Opal B Opal C Opal D * Terms URLs Opal Synchronization Group 1 • Other architectures possible • Harvesting frequency determined by individual nodes Group 2 Opal A Opal D.1 * Opal D aggregates D.1-D.3 to Group 1 * Opal D aggregates A-C to Group 2 Opal D.2 Opal D.3 Terms URLs
Connection Costs • Costcache = (WI * N) + R • WI = # of web infrastructure caches • N = connections for each WI • R = connection to get a datestamp • Costpaths = Rc + T + Rl • Rc = connections to get a cached copy • T = connections required for each term • Rl = connections to use LS Costcache = 3*1 + 1 = 4 Costpaths = 1 + T + 1
Analysis - Cumulative Terms Learned 1 Million terms 30000 Documents Result averages after 100 iterations
Analysis - Terms Learned Per Document 1 Million terms 30000 Documents Result averages after 100 iterations
Future Work • Testing on departmental server • hard to test in-the-small • Code optimizations • many short cuts taken for demo system • G & Y APIs not used; screen scraping only • Lexical Signatures • describe changes over time • IDF calculation metrics • is scraping Google valid? is it nice? • Learning new code • use OAI-PMH to update the system • OpenURL resolver • 404 URL = referent
Conclusions • Lexical signatures can be generated just-in-time from WI caches as pages disappear • Many web servers can be easily configured to use a single Opal server • Multiple Opal servers can harvest each other to learn Terms and URLs more quickly