Martin Klein @mart1nkle1n martinklein0815@gmail

SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Martin Klein @mart1nkle1n martinklein0815@gmail.com Justin F. Brunelle jbrunelle@cs.odu.edu

LANL SiteStory Team lead developer

Archiving - the traditional way • Actively crawl the web • For example, using Heritrix

Archiving - the traditional way • Issues with crawler based archiving: • Request can be rejected (robots.txt, user-agent, IP) • Can be deceived (geo-location, user-agent) • Can be trapped (crawl my calendar!) • Requires constant and massive bandwidth • Implied timing problem, when to crawl?

Archiving - the traditional way • Timing problem: • Update 1 viewed but not archived browser visit1 crawler visit1 browser visit2 t2 t3 t5 t1 t4 t6 R created R update1 R update2

Archiving - the SiteStory way • Transactional Web archiving • Archive accepts HTTP transaction between browser and server

Archiving - the traditional way • Timing problem: • Update 1 viewedand archived browser visit1 crawler visit1 browser visit2 t2 t3 t5 t1 t4 t6 R created R update1 R update2

Archiving - the SiteStory way • Challenges with transactional archiving: • To be archived server has to cooperate • Transfer data to archive, batch mode or real-time • Archive must trust transmission to be authentic • Resources from external servers have to be archived out-of-band • Deduplication challenges • Alias: different URI, same response • Conneg: same URI, different response • Determine “significant” content change

SiteStory Status Quo • mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request • not for POST, DELETE, etc • for HTTP response codes 200, 302, 303 • Client IP can be included in stored headers, configurable • Header info stored in BerkeleyDB, response body in FS • Dedup via hash(body) • Offloading content as WARC files possible(read: recommended)

SiteStory Use Case • http://www.dans.knaw.nl • LANL has been archiving the DANS website (forever) • ~32 GB since mid April 2013 • >200k resources

To Appear: TPDL 2013 • SiteStory benchmark with ab & wget • ApacheBench (ab): server stress test tool • wget: Web page download • All content: -p • Local network • Negligible difference between SiteStory and No SiteStory

Re-executed on testbed ,…, , x99 megalodon.lanl.gov @AWS ws-dl-03.cs.odu.edu

Testing with ab

Testing with wget

Round Trip Time -- Distributed

Results • Distributed: Higher variance • Increased delay due to network • On vs. Off Comparison still comparable • Viable solution without crippling service

SiteStory Installation • Apache module mod_sitestory • Option to exclude a list of directories • SiteStory Web Archive • Trivial for existing Tomcat environments • Tanuki Java wrapper (stand-alone) available • Configure, open ports, go! Or…

SiteStoryTestbed • We have a SiteStory Web Archive installed for you! • Install and configure mod_sitestory • Send an email containing: • Your contact info • Web server IP address • Server domain name used • Happy Sitestory’ing! • mailto: SiteStory-Testbed@googlegroups.com • http://mementoweb.github.io/SiteStory/

SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Martin Klein @mart1nkle1n martinklein0815@gmail.com Justin F. Brunelle jbrunelle@cs.odu.edu

Martin Klein @mart1nkle1n martinklein0815@gmail

Martin Klein @mart1nkle1n martinklein0815@gmail

Presentation Transcript

Yves Klein

Barbara Klein

Klein Dytham

Anne Klein:

William Klein

ROBIN KLEIN

David Klein

Felix Klein

Klein Bottles

Klein

klein