Archiving and Preserving the Web: Challenges and the Future

Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006

Agenda RLG Internet Archive Archive-It Challenges The Future Q&A

The importance of archiving the web • The web contains much of what will be the basis of scholarship in the future • record of events • official publications • personal viewpoints • ephemeral material

RLG’s interest • RLG mission includes working with its member organizations to enhance their ability to provide research resources • RLG members have long been participating in web archiving, but so far, this has been an activity restricted to large organizations

Members active in web archiving • Bibliothèque Nationale de France • British National Library • California Digital Library • Library of Congress • National Library of Australia • National Library of New Zealand

Archive-It pilot partners • Indiana University • International Institute of Social History • University of Toronto • Swarthmore/Haverford College

About Internet Archive • Founded in 1996 • Largest public web archive • 60 billion pages, 55 million sites • Have expanded to include texts, audio, moving images, and software: 2.6 million downloads a day

What do we collect?Web Archive • Take a broad snapshot of the web every 2 months • 2 billion pages a month • Websites from every domain (.org, .com, .edu etc) • Content in 21 languages

Policy • We follow Oakland Archive Policy, 2002 • Founded by commercial and non commercial organizations • Opt-out policy • We collect it all, and make it inaccessible if requested by site owner • Site owner directly blocks harvester on website

Access to Web Archive • Entire archive accessible for free to the public via the website at www.archive.org • Receive100 hits/second • 60k unique users per day • Evolving/Fluid: through public use we hope to find out what is important and to continuously improve

Why try to collect and preserve it all? • Web has no boundaries, no limits • What will be important? • What is there today may be gone tomorrow • “Capture now, ask why later” • “Grab it while you can, work it out later” • “Lose as little as possible”

Open Source Technology primarily developed by Internet Archive and IIPC How do we collect it? • Heritrix: web crawler • Wayback Machine: access tool for rendering and viewing files • Nutchwax: Search engine • Arc File: archival record format (ISO work item)

Wayback Machine

Preservation • Store multiple copies of each Archive • 1300 machines/servers • Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam) • Standard storage boxes, open source design

Next Steps Institutions: • need to create collections around primary source web material • want to do more than broad crawling with specific and complete web archives • want a technology partner that could harvest, index, access, store and preserve their collections for them.

1. PartnerContract Crawls • In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia and France. • Library of Congress collections: • Iraq War: 450,000,000 documents and growing • U.S. National Elections • 2000:131,331,973 documents • 2004: 87,481,265 documents • Supreme Court Nomination 2005: 100 Million documents

2. Archive-It • Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions to expand our archiving services and develop an application that acknowledge resource constraints • Developed Archive-It, web based service that allows partners to create,manage, search and store their web archives through an easy to use web interface • Does not require technical expertise or infrastructure • Pilot launched in September 2005 • 1.0 Release in February • 1.5 Release in April • 2.0 Release in July

Pilot Partners • Center for Research Libraries • Research Libraries Group ( U of Toronto, U of Indiana, Haverford and Swarthmore Colleges, IISH) • University of Texas • Library of Virginia • State Archives South Dakota • State Archives North Carolina • State Archives Alabama • Minnesota Historical Society • Institut d'Etude Politique de Grenoble

Archive-It • 1.0 Release in February • 1.5 Release in April • 2.0 Release in July

Archive-It Collections Some samples: Virginia’s political landscape, 2005 (Gov. Mark Warner) Hurricane Katrina Jamestown 2007 Commemoration

Archive-It Access • All collections are accessible for free to the general public, with text search, at: • www.archiveit. org • Partners websites with links • Plus, member web application with login

Demo

Dan’s slides Tech

Challenges we face • Making the collections useful for a variety of end users (i.e. general public, researchers) • Making sure we capture the best and most relevant content • Continuing to develop our tools for access and harvesting (crawler.archive.org)

Internet Archive’s priorities • Collaboration and Partnerships • Continue to act as a technology partner in providing web archiving services to government and memory institutions • Continue to develop Open Source software • Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium) • Open Content Alliance (OCA) digital books project • Multiple copies across the world • Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria

RLG’s web archiving program • Collaborative collection development. • Descriptive metadata for web archives. • Usability/user studies • Intellectual property concerns • Web Archiving 101 • Web archiving services and software

Archiving and Preserving the Web: Challenges and the Future