Archiving and Preserving the Web

Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006

Internet Archive Universal Access to Human Knowledge • a 501(c)(3) non-profit • Located in Presidio, San Francisco California • Founded in 1996 to build an ‘Internet library’ • Provide permanent access for researchers, historians, and scholars to historical collections that exist in digital format. • Built on open source principles • Open Source software developed by Internet Archive and the IIPC

Internet Archive Stats • Largest public web archive • 60 billion pages, 55 million sites • Have expanded to include texts, audio, moving images, and software: 2.6 million downloads a day • 60,000 unique users a day

What do we collect?Web Archive • Take a broad snapshot of the web every 2 months • 2 billion pages a month • Websites from every domain (.org, .com, .edu etc) • Content in 21 languages • Entire archive accessible for free to the public via the website at www.archive.org

Why try to collect and preserve it all? • Web has no boundaries, no limits • What will be important to future generations? • What is there today may be gone tomorrow • “Capture now, ask why later” • “Grab it while you can, work it out later” • “Lose as little as possible”

Open Source Technology primarily developed by Internet Archive and IIPC How do we collect it? • Heritrix: web crawler • Wayback Machine: access tool for rendering and viewing files • Nutch and Nutchwax: Search engine • Arc File: archival record format (ISO work item)

Wayback Machine

Preservation • Store multiple copies of each Archive • 1300 machines/servers • Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam) • Standard storage boxes, open source design

Archiving Next Steps Institutions: • need to create collections around web material • want to dig deeper in crawls for their specific websites. • Want more control and access • want a technology partner that could harvest, index, access, store and preserve their collections for them.

1. PartnerContract Crawls • In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia, France and Italy • Dedicated Crawl Engineer - Customized crawling • Library of Congress collections: (sample) • Iraq War: 450 Million documents and growing • 2004: U.S. National Elections: 88 Million documents • Supreme Court Nomination 2005: 100 Million documents

2. Archive-It • Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions: • develop an application for smaller institutions, that have some resource constraints • A web based service that allows partners to create,manage, search and store their web archives • User friendly web interface • Does not require technical expertise or infrastructure • Pilot launched in September 2005

Pilot Partners • Center for Research Libraries • Research Libraries Group ( U of Toronto, U of Indiana, Haverford and Swarthmore Colleges, IISH) • University of Texas • Library of Virginia • State Archives South Dakota • State Archives North Carolina • State Archives Alabama • Minnesota Historical Society • Institut d'Etude Politique de Grenoble

Archive-It Access • All collections are accessible for free to the general public, with text search, at: • www.archiveit. org • Partners websites with links • Plus, member web application with login

Screen shot here • Public site

Test Drive the Application

Screen shots here • Monitor page • Reports page • XML feed

Search • Your archived web pages are searchable by text or URL

Stored Online • We provide copies of the files in a hard drive that we can ship to your institution up to 2x a year

Archive-It Releases • 1.0 (February 8) • 1.5 (April 19) • 2.0 (July 29)

Challenges we face • Making the collections useful for a variety of end users (i.e. general public, researchers) • Making sure we capture the best and most relevant content • Continuing to develop our tools for access and harvesting (crawler.archive.org)

Internet Archive’s priorities • Collaboration and Partnerships • Continue to act as a technology partner in providing web archiving services to government and memory institutions • Continue to develop Open Source software • Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium) • Open Content Alliance (OCA) digital books project • Multiple copies across the world • Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria

Archiving and Preserving the Web