160 likes | 289 Views
Archive-It Architecture Introduction. April 18, 2006 Dan Avery Internet Archive. 1. Archive-It Components. Crawling User Interface Storage Playback Text Indexing Integration. 2. Component Integration. 3. Crawling. Heritrix ( http://crawler.archive.org / ) Java application
E N D
Archive-It Architecture Introduction • April 18, 2006 • Dan Avery • Internet Archive 1
Archive-It Components • Crawling • User Interface • Storage • Playback • Text Indexing • Integration 2
Crawling • Heritrix ( http://crawler.archive.org/ ) • Java application • Open source (LGPL) • Crawls for completeness/depth • Highly configurable 4
Crawling - Distributed Crawling • Heritrix Cluster Controller • Java component - open source - developed by IA • http://crawler.archive.org/hcc • Provides proxy access to pool of Heritrix instances through JMX interface • Provides crawler control and status • Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown 5
Archive-It Web Application • User Interface and Crawl Scheduling • Gets seed URLs and crawl parameters from users • Schedules new periodic crawls • Talks to crawler pool through HCC • Provides access, search, and crawl history UI 6
Storage • archive.org ARC repository • custom Perl system • simple storage on primary/backup pairs • monthly MD5 digest verification • robust, non proprietary file format • Alexandria (Egypt)/Amsterdam 7
Access • Internet Archive Wayback Machine • Replaying archived web pages since 2001 • Current IA version written in Perl and C, with components distributed across various machines • Not open source, but open source beta (in Java) available now 8
Full-Text Indexing • Nutch (http://nutch.org) • NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files • Standard text search plus link analysis • can search by date instead of relevance, useful for individual archives 9
Text Indexing Challenges • Some parts are distributable, some are not • Incremental indexing - goal of new crawls in index within 72 hours • Working on Archive-It usable map/reduce version - July • In the meantime, a lot of workarounds 10
Integration • Group of Perl and bash scripts - planning more complex than the execution • Most components available individually • Decentralized control, centralized monitoring • Each component operates almost entirely independently 11
Future Challenges • Crawler trap detection • Scalability • Current setup can accommodate 300 partners at current crawling rates • During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks • More machines can be easily added to storage and crawling clusters 13
Scalability • Current Nutch is between versions • Old version has some non-distributable pieces • New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing 14
Looking ahead • After basic UI/archiving/indexing... • Time-based search UI • Analyzing archives for research and ongoing collection improvement • Content classification • Rate of change • New site suggestions 15