70 likes | 176 Views
Challenges in Web Archiving: Library of Congress Edition. Office of Strategic Initiatives. Office of Strategic Initiatives. Abbie Grotke, Web Archiving Team NDIIPP Partner Meeting , July 21, 2010. All Hands Meeting-March 2010. Library of Congress Web Archiving Program.
E N D
Challenges in Web Archiving: Library of Congress Edition Office of Strategic Initiatives Office of Strategic Initiatives Abbie Grotke, Web Archiving Team NDIIPP Partner Meeting, July 21, 2010 All Hands Meeting-March 2010
Library of Congress Web Archiving Program • • 10 years of archiving • 5 full-time OSI staff on our team, plus 2 contractors, and other IT and Web Services support • 80+ staff selecting content for our collections: Library Services, Law Library, and Congressional Research Services • 30+ event and thematic collections • 12,500+ URLs processed and permissions sent • 181 TB of content collected p. 1
What We Do Pretty Well At This Point • • Web Archiving workflows and processes had evolved, and had become more institutionalized • Improved crawling strategies so we can react more quickly, manage our archive data better, and better serve our customers at LC • Large-scale contract crawling by Internet Archive • A move from collection-by-collection crawling to monthly and weekly “crawl buckets” • Small-scale in-house crawling now available • tests, emergency crawls p. 2
What We Do Pretty Well At This Point • Better tools now to more easily manage our team’s work and all data about various activities: nomination, permissions, crawling, quality review, reporting, etc. • Automation of manual activities to reduce time spent processing URLs for our nominators and our team p. xx
Ongoing Challenges • • Selection • What to select - so many URLs, so little time • No full-time selection staff, everyone is busy • Quality Review • Training to involve Nominators more in the process – “Did we get what you wanted us to get?” • Team Resources: • 14 web archive projects actively crawling • Testing our bandwidth p. 4
Ongoing Challenges • • Legal • Permissions: still only about 50% response rate • Access for Researchers • Harvesting: • Collection of specific types of content: rapidly changing news content, YouTube • Training Nominators re: frequency of collection • Ramping up in-house crawling (Can we? Should we?) • The Data: • How do we transfer this content easily? From IA and within LC • How do we manage it, store it, and preserve it? p. 5
More Information • • Web Archiving Team Public Page (about the activity): • http://www.loc.gov/webarchiving/ • Library of Congress Web Archives (our collections): • http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html • Digital Preservation Video on Web Archiving: • http://www.digitalpreservation.gov/videos/webarch09/index.html • Contact: Abbie Grotke, abgr@loc.gov p. 6