Digital Preservation and the Open Web: A Curatorial Perspective

Digital Preservation and the Open Web:A Curatorial Perspective Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers In Libraries March, 2006

A Brief Description of “The Web at Risk” Project How it’s organized, who’s involved Objectives of the Project Preservation of the open Web Development of an open source “Tool Kit” How it works, where it’s going, from a “special collections” perspective Overview

The Web at Risk Project • 3 year, 2.4 million dollar grant from the Library of Congress/National Digital Information Infrastructure (NDIIPP) • Coordinating Agency: The California Digital Library • Primary focus on developing open access archiving tools that can be applied to any discipline with Web content worth keeping • Extensible, modular, easily configured to work with existing technologies that are already in place

Project Stages • Content Identification and Selection • Key issues for analysis, framework for sample crawls, working with collection partners, exploring extensibility • Content Acquisition • Content Harvest and Acquisition, configuring of Web Crawler, Analyzer, Content User Interface (CUI), Export/Import Handler • Content Retention • Data model for Web Archive Digital Objects (WADO) testing and modification, assessing the CDL Digital Preservation Repository for ingest and retention • Partnership Building • Model Agreements for content retention, evaluate future steps, assess costs of sustaining a distributed approach to Web archiving

Partners in this NDIIPP Grant • Main Partners: • New York University • University of North Texas, The Libraries • Texas Center for Digital Knowledge • Technical Partners: • UC San Diego Supercomputer Center • Stanford University Computer Science Department • Sun Microsystems, Inc.

National Curatorial Partners • Arizona State University Library and Archive • New York University Tamiment Library • University of North Texas, The Libraries • Stanford University Library’s Social Sciences Research Center

University of California Curatorial Partners • UCLA Online Campaign Literature Archive • UC Berkeley Institute of Governmental Studies Library • UC Berkeley Institute of Industrial Relations Libray • Eight UC Libraries in the Federal Depository Library Program: • Berkeley, Davis, Irvine, UCLA, Riverside, San Diego, Santa Barbara, Santa Cruz

The Institute of Industrial Relations:Capturing Labor History in Action • News, data and links are being generated by unions at both the international and local level • Union priorities are necessarily “just in time” and they operate in a state of high triage • Preserving these data is a high priority for IIR and the NYU Tamiment Library • It’s not likely that a non-academic host will do so, making the challenge more urgent

Where Things Stand Now • We’ve got a Wiki and curators are in touch • IIR and NYU/Tamiment are coordinating on labor issues • Technical issues have moved to the fore • Figuring out the configuration of the crawler, what to crawl • The first crawl report has come back • The results are provocative and interesting

First Crawl Highlights • 30 sites crawled, max set to 1 gigabyte • 18 hit the 1 gigabyte limit • Average files on host: 6,359 • Average with Linked hosts included: 17,247 • Most files on a single server: 46,197 • Median Duration of crawl (host): 7hr 33m • The crawler, Heritrix 1.5.1, returned different data than other crawlers (HTTrack, Wget)

Rights and Permissions Vary According to Host A three level scheme for future rights management: • Consent Implied:Crawl without permission • 14 sites in this category • Consent Sought:Crawl but also identify and notify the data owner • 13 sites in this category • Consent Required: Advance permission needed • 3 sites in this category

Web aRchive Access (WERA) • An open source tool for viewing crawl results • Very new, very much still in development • Relies upon a search query to display the crawled resources • Does not really present how an average user would utilize a finished collection

The Fine Print Matters • Hetrix 1.5.1 doesn’t capture the directory tree of servers —it follows links • Many domains involve multiple servers, and crucial files (such as CSS libraries) need to be captured • The value of capturing linked files varies from site to site, from irrelevant to vitally important

Curator Perspectives • Need to capture “new publications” as they appear • By a slight majority, monthly intervals are favored for crawl frequency • How much multimedia be captured? The 1 gigabyte limit obscured the answer • About 70 percent of curators rated the crawl as “mostly effective” • Curators approached the process collaboratively from the very beginning—communicating proactively. This implies that collaborative collection development is viable

What’s Needed • Curators want to see some sort of user interface to evaluate the experience of viewing archived Web resources • The relationship between a particular host and whatever it links to is stimulating debate—probably, both are needed • Long term sustainability of this project will depend on attracting interest from government and industry

Looking Ahead • The Open Access toolkit will be rigorously tested (and will not appear for at least 2 years) • This approach places most responsibility with curators—just as special collection development activity would mandate • This is a new stream of work for information professionals—but the standarization of the toolkit could be an important innovation

Conclusions • The profession-wide culture of collaborative collection development is alive and well—and digesting new digital collection strategies • The combination of a toolkit “deliverable” and the pooled experience of the cohort will be enormously useful for all digital librarians • Hands-on collection experts are in an excellent position to advise technologists in the creation of new digital archiving tools— at the ground level

URLs Referenced • The Web at Risk: http://www.cdlib.org/inside/projects/preservation/webatrisk/ • Heritrix Web Site: http:////crawler.archive.org/ • Web aRchive Access: http://nea.nb.no/ • UCLA Campaign Literature Archive: http://digital.library.ucla.edu/campaign • The AFL-CIO: http://www.aflcio.org • Service Employees International Union: http://www.seiu.org • Change to Win: http://www.changetowin.org • The Institute of Industrial Relations Library: http://www.iir.berkeley.edu/library

Digital Preservation and the Open Web:A Curatorial Perspective Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers In Libraries March, 2006

Digital Preservation and the Open Web: A Curatorial Perspective