Archive-It Training

Archive-It Training University of Maryland July 12, 2007

Archive-It Mission Help memory institutions preserve the Web • Provide web based archiving and storage capabilities • No technical infrastructure required • User-friendly application

Archive-It Application Open Source Components • Heritrix: web crawler • Arc File: archival record format (ISO work item) • Wayback Machine: access tool for viewing archived websites (Arc files) • Nutchwax: bundling of Nutch (an open source search engine) used to make archived sites full text searchable • All developed by Internet Archive: http://archive-access.sourceforge.net/

Web Archiving Definitions • Host: a single or set of networked machines, designated by its Internet hostname (ex, archive.org) • Scope: rules for where a crawler can go • Sub-domains: divisions of a larger site named to the left of the host name (ex. crawler.archive.org)

Web Archiving Definitions • Seed: starting point URL for the crawler. The crawler will follow linked pages from your seed url and archive them if they are in scope. • Document: any file with a distinct URL (image, pdf, html, etc).

General Crawling Limitations Some web content cannot be archived: • Javascript: can be difficult to capture and even more difficult to display • Streaming Media • Password protected sites • Form driven content: if you have to interact with the site to get content, it cannot be captured. • Robots.txt: The crawler respects all robots.txt files (go to yourseed.com/robots.txt to see if our crawler is blocked)

Archive-It Crawling Scope • Heritrix will follow links within your seed site to capture pages • Links are in scope if they the seed is included in the root of their URL • All embedded content on seed pages is captured • Sub-domains are NOT automatically crawled • Can specify path (i.e. limit crawler to single directory* of host) - ex: www.archive.org/about/ *Always end seed directories with a ‘/’

Seed and Scope Examples Example seed www.archive.org • link: www.archive.org/about.html is in scope • link: www.yahoo.com is NOT in scope • embedded pdf: www.rlg.org/studies/metadata.pdf is in scope • Embedded image: www.rlg.org/logo.jpg is in scope • link: crawler.archive.org NOT in scope Example Seed www.archive.org/about/ • Link www.archive.org/webarchive.html NOT in scope

Changing Crawl Scope • Expand crawl scope to automatically include sub-domains using Scope Rules on the ‘edit’ seed page • Use ‘crawl settings’ to constrain your crawl by limiting overall # of documents archived, block or limit specific hosts by document number or regular expression.

Access • Archived pages are accessible in the Wayback Machine 1 hour after crawl is complete (sooner for larger crawls) • Text Searchable 7 days after crawl is complete • Public can see your archives through text search on www.archive-it.org, Archive-It templates web pages (hosted on archive-it.org), or partner made portals.

Creating Collections

Creating Collections Your collection needs: • A name chosen by your institution • A unique collection identifier: this is an abbreviated version of your collection name • Seeds: these are the starting point URLs where the crawler will begin its captures • Crawl frequency: how often your collection will be crawled (you can change this at the seed level once the collection is created) • Metadata: adding metadata is optional for your collection except for the collection description which will appear on public Archive-It site

Crawl Frequency Options • Daily crawls last 24 hours, all other crawls last 72. • Seed URLs within the same collection can be set to different frequencies. • The Test frequency allows you to crawl seeds without gathering any data so the crawl will not count against your total budget. In a test crawl all regular reports are generated. Test crawls only run for 72 hours and will crawl up to 1 million documents. • Test crawls must be started manually (from Crawls menu).

Managing Collections

Editing Seeds

Enabled: Scheduled for crawling (limited to 3) • Disabled: publicly accessible, not scheduled for crawling (unlimited) • Dormant: publicly accessible, not scheduled for crawling (unlimited)

Crawl Settings • Advanced crawl controls: crawl and host constraints • All controls found under crawl settings link

Crawl Constraints • Limit the number of documents captured per crawl instance (by frequency) • Captured URL totals could be up to 30 documents over limit, due to URLs in crawler queue at the time limit is reached

Host Constraints • Block or limit specified hosts from being crawled • Blocks/limits apply to all named sub-domains of a host • Using Regular Expressions here is OPTIONAL

Monitoring Crawls

Manually Starting a Crawl • Select the crawl frequency you want to start • Using this feature will change your future crawl schedule • Should always be used to start test crawls • Crawl should start within 5 minutes of start

Reports

Reports are available by crawl instance

Archive-It provides 4 downloadable, post-crawl reports • Top 20 Hosts: lists the top 20 hosts archived • Seed Status: reports whether seed was crawled, show if the seed redirected to a different URL and if robots.txt file blocked the crawler • Seed Source: shows how many documents and which hosts were archived per seed • MIME type: lists all the different types of files archived

Reports can be opened in Excel Above is a portion of the seed source report

Offsite Hosts in Reports • Embedded content on a website can have a different originating host than the main site address • www.archive.org can contain content from www.rlg.org in the form of a logo or any other embedded element on an www.archive.org page • When seed www.archive.org is crawled, rlg.org will show up in the host reports even though it was not a seed

Search

Search results include hits from any seed metadata entered.

Wayback Machine • Displays page as it was on the date of capture • The date of capture is displayed in the archival URL, breaks down as yyyymmddhhmmss http://wayback.archive-it.org/270/20060801211637/http://sfpl.lib.ca.us/ was captured on August 1, 2006 at 21:16:37 GMT

Archive-It Help • Online help wiki (link within application) • Partner Specialist for support (including technical) • List serv: archiveitmembers@archive.org • Report all technical bugs, issues and questions to archiveit@archive.org

Archive-It Training