200 likes | 326 Views
SiteStory Archiving Done Differently http:// mementoweb.github.io/SiteStory /. Martin Klein @mart1nkle1n martinklein0815@gmail.com. Justin F. Brunelle jbrunelle@cs.odu.edu. LANL SiteStory Team. lead developer. Archiving - the traditional way. Actively crawl the web
E N D
SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Martin Klein @mart1nkle1n martinklein0815@gmail.com Justin F. Brunelle jbrunelle@cs.odu.edu
LANL SiteStory Team lead developer
Archiving - the traditional way • Actively crawl the web • For example, using Heritrix
Archiving - the traditional way • Issues with crawler based archiving: • Request can be rejected (robots.txt, user-agent, IP) • Can be deceived (geo-location, user-agent) • Can be trapped (crawl my calendar!) • Requires constant and massive bandwidth • Implied timing problem, when to crawl?
Archiving - the traditional way • Timing problem: • Update 1 viewed but not archived browser visit1 crawler visit1 browser visit2 t2 t3 t5 t1 t4 t6 R created R update1 R update2
Archiving - the SiteStory way • Transactional Web archiving • Archive accepts HTTP transaction between browser and server
Archiving - the traditional way • Timing problem: • Update 1 viewedand archived browser visit1 crawler visit1 browser visit2 t2 t3 t5 t1 t4 t6 R created R update1 R update2
Archiving - the SiteStory way • Challenges with transactional archiving: • To be archived server has to cooperate • Transfer data to archive, batch mode or real-time • Archive must trust transmission to be authentic • Resources from external servers have to be archived out-of-band • Deduplication challenges • Alias: different URI, same response • Conneg: same URI, different response • Determine “significant” content change
SiteStory Status Quo • mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request • not for POST, DELETE, etc • for HTTP response codes 200, 302, 303 • Client IP can be included in stored headers, configurable • Header info stored in BerkeleyDB, response body in FS • Dedup via hash(body) • Offloading content as WARC files possible(read: recommended)
SiteStory Use Case • http://www.dans.knaw.nl • LANL has been archiving the DANS website (forever) • ~32 GB since mid April 2013 • >200k resources
To Appear: TPDL 2013 • SiteStory benchmark with ab & wget • ApacheBench (ab): server stress test tool • wget: Web page download • All content: -p • Local network • Negligible difference between SiteStory and No SiteStory
Re-executed on testbed ,…, , x99 megalodon.lanl.gov @AWS ws-dl-03.cs.odu.edu
Results • Distributed: Higher variance • Increased delay due to network • On vs. Off Comparison still comparable • Viable solution without crippling service
SiteStory Installation • Apache module mod_sitestory • Option to exclude a list of directories • SiteStory Web Archive • Trivial for existing Tomcat environments • Tanuki Java wrapper (stand-alone) available • Configure, open ports, go! Or…
SiteStoryTestbed • We have a SiteStory Web Archive installed for you! • Install and configure mod_sitestory • Send an email containing: • Your contact info • Web server IP address • Server domain name used • Happy Sitestory’ing! • mailto: SiteStory-Testbed@googlegroups.com • http://mementoweb.github.io/SiteStory/
SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Martin Klein @mart1nkle1n martinklein0815@gmail.com Justin F. Brunelle jbrunelle@cs.odu.edu