1 / 24

An Introduction To Heritrix

Internet Archive. An Introduction To Heritrix. Web Collection. Since 1996. Over 4x1010 ... needs of Internet Archive and International Internet Preservation Consortium ...

Melvin
Download Presentation

An Introduction To Heritrix

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix

    Slide 2:Web Collection Since 1996 Over 4x1010resources(URI+time) Over 400TB(compressed)

    Slide 3:Web Collection: via Alexa Alexa Internet Private company Crawling for IA since 1996 2-month rolling snapshots Recent: 3 billion URIs, 35 million websites, 20 TB Crawling software Sophisticated Weighted towards popular sites Proprietary: we only receive the data

    Slide 4:Heritrix: Motivations #1 Deeper, specialized, in-house crawling Sites of topical interest Contractual crawls for libraries and governments US Library of Congress Elections, current events, government websites UK Public Records Office, US National Archives Government websites Using our own software & machines

    Slide 5:Heritrix: Motivations #2 Open source Encourage collaboration on features and best practices Avoid duplication of work, incompatibilities Archival-quality Perfect copies Keep up with changing web Meet evolving needs of Internet Archive and International Internet Preservation Consortium

    Slide 6:Heritrix New Open-source Extensible Web-scale Archival-quality Web crawling software

    Slide 7:Heritrix: Use Cases Broad Crawling Large, as-much-as-possible Focused Crawling Collect specific sites/topics deeply Continuous Crawling Revisit changed sites Experimental Crawling Novel approaches

    Slide 8:Heritrix: Project Heritrix means heiress Java, modular Project website: http://crawler.archive.org News, downloads, documentation Sourceforge: open source hosting site Source-code control (CVS) Issue databases “Lesser” GPL license Outside contributions

    Slide 9:http://crawler.archive.org

    Slide 10:Heritrix: Milestones Summer 2003: Prototypes created and tested against existing crawlers; requirements collected from IA and IIPC October 2003-April 2004: Nordic Web Archive programmers join project, add capabilities January 2004: First public beta (0.2.0) Used for all in-house crawling since February & June 2004: Workshops for Heritrix users at national libraries August 2004: Version 1.0.0 released

    Slide 11:Heritrix: Architecture Basic loop: 1. Choose a URI from among all those scheduled 2. Fetch that URI 3. Analyze or archive the results 4. Select discovered URIs of interest, and add to those scheduled 5. Note that the URI is done and repeat Parallelized across threads (and eventually, machines)

    Slide 12:Key components of Heritrix Scope which URIs should be included (seeds + rules) Frontier which URIs are done, or waiting to be done (queues and lists/maps) Processor chains configurable sequential tasks to do to each URI (code modules + configuration)

    Slide 13:Heritrix: Architecture

    Slide 14:Heritrix: Processor Chains Prefetch Ensure conditions are met Fetch Network activity (HTTP, DNS, FTP, etc.) Extract Analyze – especially for new URIs Write Save archival copy to disk Postprocess Feed URIs back to Frontier, update crawler state

    Slide 15:Heritrix: Features & Limitations Other key features: Web UI console to control & monitor crawl Very configurable inclusion, exclusion, politeness policies Limitations: Requires sophisticated operator Large crawls hit single-machine limits No capacity for automatic revisit of changed material Generally: Good for focused & experimental crawling use cases; not yet for broad and continuous

    Slide 16:Heritrix console

    Slide 17:Heritrix settings

    Slide 18:Heritrix logs

    Slide 19:Heritrix reports

    Slide 20:Heritrix: Current Uses Weekly, Monthly, 6-monthly, and special one-time crawls Hundreds to thousands of specific target sites Over 20 million collected URIs per crawl Crawls run for 1-2 weeks

    Slide 21:Heritrix: Performance Not yet stressed, optimized Current crawls limited by material to crawl and chosen politeness, not our performance Typical observed rates (actual focused crawls) 20-40 URIs/sec (peaking over 60) 2-3Mbps (peaking over 20Mbps) Limits imposed by memory usage Over 10,000 hosts/over 10 million URIs (512MB machine, more on larger machines)

    Slide 22:Heritrix: Future Plans Larger scale crawl capacity Giant focused crawls Broad whole-web crawls New protocols & formats Automate expert operator tasks Continuous and dynamic crawling Revisit sites as they change Dynamically rank sites and URIs

    Slide 23:Latest Developments 1.2 Release (next week) Configurable canonicalization Handles common session-IDs, URI variations Politeness by IP address Experimental more memory-efficient Frontier Bug fixes 1.4 Release (January 2004) Memory robustness Experimental multi-machine distribution support

    Slide 24:The End Questions?

More Related