1 / 7

Office of Strategic Initiatives

Challenges in Web Archiving: Library of Congress Edition. Office of Strategic Initiatives. Office of Strategic Initiatives. Abbie Grotke, Web Archiving Team NDIIPP Partner Meeting , July 21, 2010. All Hands Meeting-March 2010. Library of Congress Web Archiving Program.

hachi
Download Presentation

Office of Strategic Initiatives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges in Web Archiving: Library of Congress Edition Office of Strategic Initiatives Office of Strategic Initiatives Abbie Grotke, Web Archiving Team NDIIPP Partner Meeting, July 21, 2010 All Hands Meeting-March 2010

  2. Library of Congress Web Archiving Program • • 10 years of archiving • 5 full-time OSI staff on our team, plus 2 contractors, and other IT and Web Services support • 80+ staff selecting content for our collections: Library Services, Law Library, and Congressional Research Services • 30+ event and thematic collections • 12,500+ URLs processed and permissions sent • 181 TB of content collected p. 1

  3. What We Do Pretty Well At This Point • • Web Archiving workflows and processes had evolved, and had become more institutionalized • Improved crawling strategies so we can react more quickly, manage our archive data better, and better serve our customers at LC • Large-scale contract crawling by Internet Archive • A move from collection-by-collection crawling to monthly and weekly “crawl buckets” • Small-scale in-house crawling now available • tests, emergency crawls p. 2

  4. What We Do Pretty Well At This Point • Better tools now to more easily manage our team’s work and all data about various activities: nomination, permissions, crawling, quality review, reporting, etc. • Automation of manual activities to reduce time spent processing URLs for our nominators and our team p. xx

  5. Ongoing Challenges • • Selection • What to select - so many URLs, so little time • No full-time selection staff, everyone is busy • Quality Review • Training to involve Nominators more in the process – “Did we get what you wanted us to get?” • Team Resources: • 14 web archive projects actively crawling • Testing our bandwidth p. 4

  6. Ongoing Challenges • • Legal • Permissions: still only about 50% response rate • Access for Researchers • Harvesting: • Collection of specific types of content: rapidly changing news content, YouTube • Training Nominators re: frequency of collection • Ramping up in-house crawling (Can we? Should we?) • The Data: • How do we transfer this content easily? From IA and within LC • How do we manage it, store it, and preserve it? p. 5

  7. More Information • • Web Archiving Team Public Page (about the activity): • http://www.loc.gov/webarchiving/ • Library of Congress Web Archives (our collections): • http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html • Digital Preservation Video on Web Archiving: • http://www.digitalpreservation.gov/videos/webarch09/index.html • Contact: Abbie Grotke, abgr@loc.gov p. 6

More Related