1 / 20

Martin Klein @mart1nkle1n martinklein0815@gmail

SiteStory Archiving Done Differently http:// mementoweb.github.io/SiteStory /. Martin Klein @mart1nkle1n martinklein0815@gmail.com. Justin F. Brunelle jbrunelle@cs.odu.edu. LANL SiteStory Team. lead developer. Archiving - the traditional way. Actively crawl the web

kitra
Download Presentation

Martin Klein @mart1nkle1n martinklein0815@gmail

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Martin Klein @mart1nkle1n martinklein0815@gmail.com Justin F. Brunelle jbrunelle@cs.odu.edu

  2. LANL SiteStory Team lead developer

  3. Archiving - the traditional way • Actively crawl the web • For example, using Heritrix

  4. Archiving - the traditional way • Issues with crawler based archiving: • Request can be rejected (robots.txt, user-agent, IP) • Can be deceived (geo-location, user-agent) • Can be trapped (crawl my calendar!) • Requires constant and massive bandwidth • Implied timing problem, when to crawl?

  5. Archiving - the traditional way • Timing problem: • Update 1 viewed but not archived browser visit1 crawler visit1 browser visit2 t2 t3 t5 t1 t4 t6 R created R update1 R update2

  6. Archiving - the SiteStory way • Transactional Web archiving • Archive accepts HTTP transaction between browser and server

  7. Archiving - the traditional way • Timing problem: • Update 1 viewedand archived browser visit1 crawler visit1 browser visit2 t2 t3 t5 t1 t4 t6 R created R update1 R update2

  8. Archiving - the SiteStory way • Challenges with transactional archiving: • To be archived server has to cooperate • Transfer data to archive, batch mode or real-time • Archive must trust transmission to be authentic • Resources from external servers have to be archived out-of-band • Deduplication challenges • Alias: different URI, same response • Conneg: same URI, different response • Determine “significant” content change

  9. SiteStory Status Quo • mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request • not for POST, DELETE, etc • for HTTP response codes 200, 302, 303 • Client IP can be included in stored headers, configurable • Header info stored in BerkeleyDB, response body in FS • Dedup via hash(body) • Offloading content as WARC files possible(read: recommended)

  10. SiteStory Use Case • http://www.dans.knaw.nl • LANL has been archiving the DANS website (forever) • ~32 GB since mid April 2013 • >200k resources

  11. To Appear: TPDL 2013 • SiteStory benchmark with ab & wget • ApacheBench (ab): server stress test tool • wget: Web page download • All content: -p • Local network • Negligible difference between SiteStory and No SiteStory

  12. Re-executed on testbed ,…, , x99 megalodon.lanl.gov @AWS ws-dl-03.cs.odu.edu

  13. Testing with ab

  14. Testing with wget

  15. Round Trip Time -- Distributed

  16. Results • Distributed: Higher variance • Increased delay due to network • On vs. Off Comparison still comparable • Viable solution without crippling service

  17. SiteStory Installation • Apache module mod_sitestory • Option to exclude a list of directories • SiteStory Web Archive • Trivial for existing Tomcat environments • Tanuki Java wrapper (stand-alone) available • Configure, open ports, go! Or…

  18. SiteStoryTestbed • We have a SiteStory Web Archive installed for you! • Install and configure mod_sitestory • Send an email containing: • Your contact info • Web server IP address • Server domain name used • Happy Sitestory’ing! • mailto: SiteStory-Testbed@googlegroups.com • http://mementoweb.github.io/SiteStory/

  19. SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Martin Klein @mart1nkle1n martinklein0815@gmail.com Justin F. Brunelle jbrunelle@cs.odu.edu

More Related