170 likes | 303 Views
Archive-It Architecture Introduction. April 3, 2006 Dan Avery Internet Archive. Archive-It Components. Crawling User Interface Storage Playback Text Indexing Integration. Component Integration. Crawling. Heritrix ( http://crawler.archive.org / ) Java application Open source (LGPL)
E N D
Archive-It Architecture Introduction • April 3, 2006 • Dan Avery • Internet Archive
Archive-It Components • Crawling • User Interface • Storage • Playback • Text Indexing • Integration
Crawling • Heritrix ( http://crawler.archive.org/ ) • Java application • Open source (LGPL) • Crawls for completeness/depth • Highly configurable
Crawling - Distributed Crawling • Heritrix Cluster Controller • Java component - open source - developed by IA • http://crawler.archive.org/hcc • Provides proxy access to pool of Heritrix instances through JMX interface • Provides crawler control and status • Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown
Archive-It Web Application • User Interface and Crawl Scheduling • Gets seed URLs and crawl parameters from users • Schedules new periodic crawls • Talks to crawler pool through HCC • Provides access, search, and crawl history UI
Storage • archive.org ARC repository • custom Perl system • simple storage on primary/backup pairs • monthly MD5 digest verification • robust, non proprietary file format • Alexandria (Egypt)/Amsterdam
Access • Internet Archive Wayback Machine • Replaying archived web pages since 2001 • Current IA version written in Perl and C, with components distributed across various machines • Not open source, but open source beta (in Java) available now
Full-Text Indexing • Nutch (http://nutch.org) • NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files • Standard text search plus link analysis • can search by date instead of relevance, useful for individual archives
Text Indexing Challenges • Some parts are distributable, some are not • Incremental indexing - goal of new crawls in index within 72 hours • Working on Archive-It usable map/reduce version - July • In the meantime, a lot of workarounds
Integration • Group of Perl and bash scripts - planning more complex than the execution • Most components available individually • Decentralized control, centralized monitoring • Each component operates almost entirely independently
Future Challenges • Crawler trap detection • Scalability • Current setup can accommodate 300 partners at current crawling rates • During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks • More machines can be easily added to storage and crawling clusters
Scalability • Current Nutch is between versions • Old version has some non-distributable pieces • New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing
Looking ahead • After basic UI/archiving/indexing... • Time-based search UI • Analyzing archives for research and ongoing collection improvement • Content classification • Rate of change • New site suggestions
RLG’s Web Archiving Program • Collaborative collection development. • Descriptive metadata for web archives. • Usability/user studies • Intellectual property concerns • Web Archiving 101 • Web archiving services and software