1 / 30

Just-In-Time Recovery of Missing Web Pages

Just-In-Time Recovery of Missing Web Pages. Hypertext 2006 Odense, Denmark August 25, 2006 Terry L. Harrison & Michael L. Nelson Old Dominion University Norfolk VA, USA. Preservation: Fortress Model. Five Easy Steps for Preservation:. Get a lot of $

tieve
Download Presentation

Just-In-Time Recovery of Missing Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Just-In-Time Recovery of Missing Web Pages Hypertext 2006 Odense, Denmark August 25, 2006 Terry L. Harrison & Michael L. Nelson Old Dominion University Norfolk VA, USA

  2. Preservation: Fortress Model Five Easy Steps for Preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

  3. Alternate Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Find a “good enough” replacement web page • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

  4. Outline • The 404 problem • Component technologies • web infrastructure • lexical signatures • OAI-PMH • Opal • architectural description • analysis

  5. 404 Problem • Kahle (97) - Average page lifetime 44 days • Koehler (99, 04) - 67% URLs lost in 4 years • Lawrence et al. (01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) • Spinellis (03) - 27% URLs in CACM/Computer papers gone in 5 years • Chan et al. (03) - 11 year half-life for URLs in D-Lib Magazine articles • Nelson & Allen (02) - 3% objects in digital library gone in 1 year ECDL 1999 “good enough” page available PSP 2003 exact copy at new URL Greynet 99 unavailable at any URL?

  6. Web Infrastructure: Refreshing & Migrating

  7. Lexical Signatures • “Robust Hyperlinks Cost Just Five Words Each” • Phelps & Wilensky (2000) http://www.cs.odu.edu/~tharriso/?lex-sig=terry+harrison+thesis+jcdl+awarded • “Analysis of Lexical Signatures for Improving Information Presence on the World Wide Web” • Park et al. (2004)

  8. OAI-PMH Data Providers / Repositories Service Providers / Harvesters “A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.”  “A harvester is a client application that issues OAI-PMH requests.  A harvester is operated by a service provider as a means of collecting metadata from repositories.”

  9. OAI-PMH Aggregators • aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery data providers (repositories) service providers (harvesters) aggregator

  10. Observations • One reason why the original Phelps & Wilensky vision was never realized is that required a priori LS calculation • idea: use the Web Infrastructure to calculate LSs as they are needed • Mass adoption of a system will occur only if it is really, really easy to do so • idea: digital preservation systems should require only a small number of “heroes”

  11. Description & Use Cases • Allow many web servers to use a few Opal servers that use the caches of the Web Infrastructure to generate Lexical Signatures of recently 404 URLs to find either: • the same page at a new URL • example: bookmarked colleague is now 404 • cached info is not useful • similar pages probably not useful • a “good enough” replacement page • example: bookmarked recipe is now 404 • cached info is useful • similar pages probably useful

  12. Opal Configuration: “Configure Two Things” edit httpd.conf add / edit custom 404 page

  13. Opal High-Level Architecture 1. Get URL X Interactive User www.bar.org 2. Custom 404 page 3. Pagetag redirects User to Opal server 5. Opal gives user navigation options 4. Opal searches WI caches; creates LS opal.foo.edu

  14. Locating Caches http://www.google.com/search?hl=en&ie=ISO-8859-1&q=http://www.cs.odu.edu/~tharriso http://search.yahoo.com/search?fr=FP-pull-web-t&ei=UTF8&p=http://www.cs.odu.edu/~tharriso

  15. Internet Archive

  16. WI Caches Last 7-51 days* • IA caches forever, but: • may not ever crawl you • ~12 month latency • no internal backups * Frank McCown, Joan A. Smith, Michael L. Nelson, Johan Bollen, Reconstructing Websites for the Lazy Webmaster, arXiv cs.IR/0512069, 2005. http://arxiv.org/abs/cs.IR/0512069

  17. Term Frequency  Inverse Document Frequency • Calculating Term Frequency is easy • frequency of term in this document • Calculating Document Frequency is hard • frequency of term in all documents • assumes knowledge of entire corpus! • “Good terms” appear: • frequently in a single document • infrequently across all documents

  18. Scraping Google to Approximate DF • Frequency of term across all documents: • How many documents?

  19. GUI - Bootstrapping

  20. GUI - Learned

  21. GUI (cont) • <url:similarURL datestamp="2005-05-13" votes="1" • simURL="http://www.cs.odu.edu/~tharriso/" baseURL="http://invivo_test.com"> • <![CDATA[<p class=g> • <a href="javascript:popUp('demo_dev.pl?method=vote&url=http://www.cs.odu.edu/~tharriso • &match=http://www.cs.odu.edu/~tharriso/')"> • <b>Terry</b> <b>Harrison</b> Profile Page</a><br><font size=-1>Burning Man Images Other Images • (not really well sorted, sorry!) Email <b>Terry</b> <b>...</b><br> • (May 2003), AR Zipf Fellowship <b>Awarded</b> to <b>Terry</b> <b>Harrison</b> - Press Release • <b>...</b><br><font color=#008000>www.cs.odu.edu/~tharriso/ - 12k - </font></font>]]> • </url:similarURL>

  22. Opal Server Databases • URL database • 404 URL  (LS, similarURL1, similarURL2, …, similarURLN) • similarURL  (URL, datestamp, votes, Opal server) • Term database • term  (Opal server, source, datestamp, DF, corpus size, IDF) Define each URL and Term as OAI-PMH Records and we can harvest what an Opal server has “learned” - can accommodate late arrivers (no “cold start” for them) - pool the learning of multiple servers - incentives to cooperate

  23. Opal A Opal B Opal C Opal D * Terms URLs Opal Synchronization Group 1 • Other architectures possible • Harvesting frequency determined by individual nodes Group 2 Opal A Opal D.1 * Opal D aggregates D.1-D.3 to Group 1 * Opal D aggregates A-C to Group 2 Opal D.2 Opal D.3 Terms URLs

  24. Discovery via OAI-PMH

  25. Connection Costs • Costcache = (WI * N) + R • WI = # of web infrastructure caches • N = connections for each WI • R = connection to get a datestamp • Costpaths = Rc + T + Rl • Rc = connections to get a cached copy • T = connections required for each term • Rl = connections to use LS Costcache = 3*1 + 1 = 4 Costpaths = 1 + T + 1

  26. Analysis - Cumulative Terms Learned 1 Million terms 30000 Documents Result averages after 100 iterations

  27. Analysis - Terms Learned Per Document 1 Million terms 30000 Documents Result averages after 100 iterations

  28. Load Estimation

  29. Future Work • Testing on departmental server • hard to test in-the-small • Code optimizations • many short cuts taken for demo system • G & Y APIs not used; screen scraping only • Lexical Signatures • describe changes over time • IDF calculation metrics • is scraping Google valid? is it nice? • Learning new code • use OAI-PMH to update the system • OpenURL resolver • 404 URL = referent

  30. Conclusions • Lexical signatures can be generated just-in-time from WI caches as pages disappear • Many web servers can be easily configured to use a single Opal server • Multiple Opal servers can harvest each other to learn Terms and URLs more quickly

More Related