180 likes | 202 Views
Learn about NDIIPP's web preservation strategies, partnerships, and challenges in preserving born-digital and "at-risk" web content. Discover the collaborative initiatives and collection strategies shaping the future of web archiving.
E N D
LC Perspective : Preservation Partnerships Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005
Born Digital “At-Risk” Web Sites http://www.loc.gov/minerva/collect/elec2000 http://www.loc.gov/minerva/collect/sept11
Take Actions that are NDIIPP Strategic Direction • Catalytic • Invest in existing strengths • Collaborative • Engage partners in areas of mutual interest and expertise • Iterative • Learn by doing • Strategic • Broad spectrum of balanced short-term & investments
Web of projects NARA GPO LC Web Projects UIUC IA Preservation Partners IIPC AIHT NDIIP CDL States Initiative
Library of Congress Web Archiving Strategy • Collaborate with partners working on the same preservation issues • Develop collection strategies to leverage available resources • Learn by doing
Collaborate with partners working on the same preservation issues • Membership in the International Internet Preservation Consortium (IIPC) • Cooperative projects with NDIIPP Preservation Partners • California Digital Library • University of Illinois at Champaign-Urbana • Technical information sharing with other US government agencies • Government Printing Office • National Archives and Records Administration
Develop collection strategies to leverage available resources • Collect thematically both by crawling and by acquiring collections gathered by others Learn by doing • Case studies and regular collection of theme-based collections • Participate in tools development with IIPC • Archive Ingest & Handling Project
Challenges of collecting from the Web • Characteristics of the resource--dynamic, deep, linked • Intellectual property laws and regulations • Tension of preservation vs access goals • Degree of alignment with current collection policies for other media • Curation strategy • Tools for identification and selection • Tools for collection, curation, and archiving of large web collections
Average Web Collection • Begins with a theme or event • Usually does not include commercial sites • Starts with a list of about 200 urls • Is crawled by vendor • Yields about 1 TB of data per month • Has a frequency of once a week
Web Collections to date at LC • Event-based • US National Elections—2000, 2002, 2004 • War in Iraq • September 11 • Public Policy Topics • Health Care • Legislative Branch • Terrorism • 26 TB
AIHT is a first test of proposed NDIIP preservation architecture. The test is conducted with a common data set. George Mason University 9/11 Archive Phase I tests ingest and data handling in local systems. Phase II tests export and import between institutions. Phase III explores format migration. Archive Ingest & Handling Test
GMU 9/11 Archive Participants exchange archive Participants demonstrate capabilities
Participants • Old Dominion University, Department of Computer Science • Stanford University Libraries & Academic Information Resources • The Johns Hopkins University, Sheridan Libraries • Harvard University Library
George Mason University 9/11 Archive: Breakdown by File Types • 57,450+ files • 12GB • Originally stored in • a Linux environment
Goals of AIHT • Gain practical experience with multiple institutions • Document transfer and ingest processes for multiple systems • Determine next set of tasks for developing interfaces between layers and institutions
Status of AIHT • All phases completed. • Imports focused on technical assessment of archive and developing tools to examine the archive • Exports included METS and MPG21 DID objects • Migrations included transforms to JPG2000, TIFF, and some exploration of html to xml and avi to mpg • Full report expected by early summer.
For more information…. • NDIIPP Technical Architecture version 0.2 http://www.digitalpreservation.gov • International Internet Preservation Consortium http://netpreserve.org/about/index.php • MINERVA: Mapping the INternet Electronic Resources Virtual Archive http://www.loc.gov/minerva/
Martha Anderson NDIIP Program Officer Office of Strategic Initiatives The Library of Congress Washington, DC mande@loc.gov