200 likes | 221 Views
Learn how to normalize metadata effectively to enhance services, improve quality, and ensure predictability for reharvesting in your repository. Understand the principles, methodology, and benefits of normalizing metadata for efficient data management.
E N D
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University
other OAI Services OAI in the NSDL Infrastructure Your collection’s metadata Your collection’s OAI server NSDL Metadata Repository (MR) NSDL MR OAI server Your collection’s metadata, scrubbed & normalized NSDL Search Service NSDL Archive Service http://nsdl.org
The Metadata Repository • Designed to be scaleable • Based on automated harvest/expose model, with OAI at each end • A notion of “normalized” metadata with Qualified Dublin Core as its base
Why do we normalize metadata? • Improve services (e.g. search results, or UI display) • Improve metadata quality, when possible • Enhance predictability of data for reharvesting services
How do we normalize metadata? • Perform “safe” transforms to “smarten up” metadata • XSL stylesheets -- from your XML metadata to our normalized XML metadata • Principles: • Do no harm (Don’t lose information) • Add information, when possible • Indicate schemes for valid values • Remove meaningless text • “…”, “not available”, “-” • Empty elements • Correct wrong information • “text/pdf” “application/pdf” • Remove characters that impede functionality or display • Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) • Scrub URLs
Automated MR Ingest process • Your collection info and harvesting info is registered • OAI validation – can we run our harvester on your OAI server? (see handout) • OAI harvest of your metadata (nsdl_dc if available; oai_dc if not ...) • XML schema validation of all of your metadata • UTF-8 encoding validation, and make bad UTF-8 chars into harmless ones. • Normalized nsdl_dc created. • Your metadata, “raw” and normalized, is loaded into the MR tables and made available to the NSDL’s MR OAI server.
Automated MR ingest process Your collection’s OAI server Validation OAI Harvest NSDL Collection Registration “raw” or “native” metadata Validation Normalize normalized metadata NSDL MR OAI server Metadata Repository Notify collection of problems; May need to halt processing
OAI-PMH: Key points • OAI-PMH requests are embedded in HTTP • it’s a web service, not a flat file • XML, not HTML • multiple metadata formats are allowed • OAI ≠ simple DC only! • Each metadata format MUST have a valid XML schema
MR ingest requires: compliant OAI 2.0 server • Correctly implements OAI-PMH; queries to all verbs respond correctly. • Every OAI response must be (deeply) XML schema valid • Encodes properly in proper places • XML encoding • URL encoding • UTF-8 encoding
OAI 2.0 – Identify • baseURL • email address • protocol version • description for OAI identifier syntax, especially if adhering to oai-identifier syntax described in Implementation Guidelines
OAI 2.0 – ListMetadataFormats • correct XML namespace for each format • a valid XML schema for each format • targetNamespace MUST match XML namespace above • super easy out: use oai_dc • easy out: use nsdl_dc
OAI 2.0 – ListSets • super easy out: if all your metadata is NSDL relevant, don’t use sets for our sake. • if you want the NSDL to harvest only SOME of your OAI server’s metadata, then use sets. • We will harvest only the sets you specify … but our default is to harvest all of them. • super easy setSpec strings: use only alpha-num characters
OAI 2.0 – ListRecords • Every metadata record served must (deeply) validate to its indicated XML schema • If used, resumptionTokens must be implemented properly • RT is an exclusive argument • Last response has an empty RT • Selective Harvesting works properly • “from” and “until” arguments do limit the results appropriately • “set” arguments do limit the results appropriately, if implemented
Common Points of Confusion - 1 about the metadata vs. about the resource • identifiers: OAI vs. DC • record/header/identifier vs. record/metadata/../dc:identifier • dates: OAI vs. DC • record/header/datestamp vs. record/metadata/../dc:date • OAI about containers are about the metadata • rights: OAI about vs. DC • record/about/../(dc:rights?)vs. record/metadata/../dc:rights
OAI identifiers • Must uniquely identify individual metadata records at your site for OAI harvest and OAI reharvest • Must stay the same for your metadata records • metadata is updated; OAI identifier unchanged
Common Points of Confusion - 2 • Dates • format confusion • OAI dates must be encoded as ISO8601 and must be in UTC (≈ GMT) • OAI-PMH allows YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ. • DC date encoding – “Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format.” • <responseDate> (All OAI-PMH responses) • Time when OAI server responds to a request • OAI-PMH sez: ‘must be the time and date of the response in UTC. This is encoded using the "Complete date plus hours, minutes, and seconds" variant of ISO8601 . This format is YYYY-MM-DDThh:mm:ssZ.’ • <datestamp> (OAI-PMH <record>/<header>) • “from” and “until” arguments in OAI requests • <dc:date>
When a Collection Deletes Records • if not indicated in OAI server • incremental harvest for MR never shows update; MR copy never deleted! • if indicated in OAI server transiently • reharvested soon enough – • not reharvested soon enough – incremental harvest for MR never shows update; MR copy never deleted! • if OAI server indicated and persistent • MR finds delete on incremental harvest –
Deleted Records – Our Solution “Full reharvest” • Mark all the site’s records in MR “deleted” • Harvest all metadata records for the collection • As we ingest each newly retrieved record into the MR, if we over-write an old record, “un-delete” it. • Expensive • network bandwidth • processing time • Okay for small collections (under ~15,000) • Okay for metadata that changes infrequently
In an ideal world, we’d like • nsdl_dc • Information about nsdl_dc, example records and its XML schemas is in the NSDL Metadata Primer. • Persistent deleted records • OAI identifier syntax, per OAI Implementation Guidelines