240 likes | 1.2k Views
Internet Archive. An Introduction To Heritrix. Web Collection. Since 1996. Over 4x1010 ... needs of Internet Archive and International Internet Preservation Consortium ...
E N D
Slide 1:
Gordon Mohr
Chief Technologist, Web Projects
Internet Archive
An Introduction To Heritrix
Slide 2:Web Collection
Since 1996
Over 4x1010resources(URI+time)
Over 400TB(compressed)
Slide 3:Web Collection: via Alexa Alexa Internet
Private company
Crawling for IA since 1996
2-month rolling snapshots
Recent: 3 billion URIs, 35 million websites, 20 TB
Crawling software
Sophisticated
Weighted towards popular sites
Proprietary: we only receive the data
Slide 4:Heritrix: Motivations #1 Deeper, specialized, in-house crawling
Sites of topical interest
Contractual crawls for libraries and governments
US Library of Congress
Elections, current events, government websites
UK Public Records Office, US National Archives
Government websites
Using our own software & machines
Slide 5:Heritrix: Motivations #2 Open source
Encourage collaboration on features and best practices
Avoid duplication of work, incompatibilities
Archival-quality
Perfect copies
Keep up with changing web
Meet evolving needs of Internet Archive and International Internet Preservation Consortium
Slide 6:Heritrix New
Open-source
Extensible
Web-scale
Archival-quality
Web crawling software
Slide 7:Heritrix: Use Cases Broad Crawling
Large, as-much-as-possible
Focused Crawling
Collect specific sites/topics deeply
Continuous Crawling
Revisit changed sites
Experimental Crawling
Novel approaches
Slide 8:Heritrix: Project Heritrix means heiress
Java, modular
Project website: http://crawler.archive.org
News, downloads, documentation
Sourceforge: open source hosting site
Source-code control (CVS)
Issue databases
“Lesser” GPL license
Outside contributions
Slide 9:http://crawler.archive.org
Slide 10:Heritrix: Milestones Summer 2003: Prototypes created and tested against existing crawlers; requirements collected from IA and IIPC
October 2003-April 2004: Nordic Web Archive programmers join project, add capabilities
January 2004: First public beta (0.2.0)
Used for all in-house crawling since
February & June 2004: Workshops for Heritrix users at national libraries
August 2004: Version 1.0.0 released
Slide 11:Heritrix: Architecture Basic loop:
1. Choose a URI from among all those scheduled
2. Fetch that URI
3. Analyze or archive the results
4. Select discovered URIs of interest, and add to those scheduled
5. Note that the URI is done and repeat
Parallelized across threads (and eventually, machines)
Slide 12:Key components of Heritrix Scope
which URIs should be included
(seeds + rules)
Frontier
which URIs are done, or waiting to be done
(queues and lists/maps)
Processor chains
configurable sequential tasks to do to each URI
(code modules + configuration)
Slide 13:Heritrix: Architecture
Slide 14:Heritrix: Processor Chains Prefetch
Ensure conditions are met
Fetch
Network activity (HTTP, DNS, FTP, etc.)
Extract
Analyze – especially for new URIs
Write
Save archival copy to disk
Postprocess
Feed URIs back to Frontier, update crawler state
Slide 15:Heritrix: Features & Limitations Other key features:
Web UI console to control & monitor crawl
Very configurable inclusion, exclusion, politeness policies
Limitations:
Requires sophisticated operator
Large crawls hit single-machine limits
No capacity for automatic revisit of changed material
Generally:
Good for focused & experimental crawling use cases; not yet for broad and continuous
Slide 16:Heritrix console
Slide 17:Heritrix settings
Slide 18:Heritrix logs
Slide 19:Heritrix reports
Slide 20:Heritrix: Current Uses Weekly, Monthly, 6-monthly, and special one-time crawls
Hundreds to thousands of specific target sites
Over 20 million collected URIs per crawl
Crawls run for 1-2 weeks
Slide 21:Heritrix: Performance Not yet stressed, optimized
Current crawls limited by material to crawl and chosen politeness, not our performance
Typical observed rates (actual focused crawls)
20-40 URIs/sec (peaking over 60)
2-3Mbps (peaking over 20Mbps)
Limits imposed by memory usage
Over 10,000 hosts/over 10 million URIs (512MB machine, more on larger machines)
Slide 22:Heritrix: Future Plans Larger scale crawl capacity
Giant focused crawls
Broad whole-web crawls
New protocols & formats
Automate expert operator tasks
Continuous and dynamic crawling
Revisit sites as they change
Dynamically rank sites and URIs
Slide 23:Latest Developments 1.2 Release (next week)
Configurable canonicalization
Handles common session-IDs, URI variations
Politeness by IP address
Experimental more memory-efficient Frontier
Bug fixes
1.4 Release (January 2004)
Memory robustness
Experimental multi-machine distribution support
Slide 24:The End Questions?