1 / 14

Metadata Extraction & Web Archives: Automating the Record Creation Process

Metadata Extraction & Web Archives: Automating the Record Creation Process. Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team. Library of Congress Web Archives. Since 2000, 20+ thematic, event-based collections

jase
Download Presentation

Metadata Extraction & Web Archives: Automating the Record Creation Process

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team

  2. Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12,500+ URLs http://www.loc.gov/lcwa

  3. Web Archiving Tools • Crawling: • Heritrix • WARC • Access: • Wayback Machine • NutchWAX International Internet Preservation Consortium netpreserve.org

  4. LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction

  5. Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA • One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?

  6. XML MODS Template

  7. Metadata Extraction • For each URL that will be cataloged: • Get archived web site metadata • Combine with URL Nominations Database metadata • If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery

  8. Data Sources for Metadata Extraction

  9. URL Access Rights Language(s) Category Subject Terms URL Nominations Database

  10. Name URL Party Affiliation State Race District (House) Election Candidate Metadata

  11. From 1st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last) Archived Web Site Metadata

  12. Combined Data in Template

  13. Combined Data in Template

  14. Combined Data in Template

More Related