Metadata Extraction & Web Archives: Automating the Record Creation Process

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team

Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12,500+ URLs http://www.loc.gov/lcwa

Web Archiving Tools • Crawling: • Heritrix • WARC • Access: • Wayback Machine • NutchWAX International Internet Preservation Consortium netpreserve.org

LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction

Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA • One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?

XML MODS Template

Metadata Extraction • For each URL that will be cataloged: • Get archived web site metadata • Combine with URL Nominations Database metadata • If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery

Data Sources for Metadata Extraction

URL Access Rights Language(s) Category Subject Terms URL Nominations Database

Name URL Party Affiliation State Race District (House) Election Candidate Metadata

From 1st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last) Archived Web Site Metadata

Combined Data in Template

Metadata Extraction & Web Archives: Automating the Record Creation Process