140 likes | 267 Views
Metadata Extraction & Web Archives: Automating the Record Creation Process. Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team. Library of Congress Web Archives. Since 2000, 20+ thematic, event-based collections
E N D
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.gov Gina Jones / gjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team
Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12,500+ URLs http://www.loc.gov/lcwa
Web Archiving Tools • Crawling: • Heritrix • WARC • Access: • Wayback Machine • NutchWAX International Internet Preservation Consortium netpreserve.org
LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction
Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA • One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?
Metadata Extraction • For each URL that will be cataloged: • Get archived web site metadata • Combine with URL Nominations Database metadata • If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery
URL Access Rights Language(s) Category Subject Terms URL Nominations Database
Name URL Party Affiliation State Race District (House) Election Candidate Metadata
From 1st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last) Archived Web Site Metadata