1 / 20

The NSDL, OAI and Your Metadata

Learn how to normalize metadata effectively to enhance services, improve quality, and ensure predictability for reharvesting in your repository. Understand the principles, methodology, and benefits of normalizing metadata for efficient data management.

Download Presentation

The NSDL, OAI and Your Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University

  2. other OAI Services OAI in the NSDL Infrastructure Your collection’s metadata Your collection’s OAI server NSDL Metadata Repository (MR) NSDL MR OAI server Your collection’s metadata, scrubbed & normalized NSDL Search Service NSDL Archive Service http://nsdl.org

  3. The Metadata Repository • Designed to be scaleable • Based on automated harvest/expose model, with OAI at each end • A notion of “normalized” metadata with Qualified Dublin Core as its base

  4. Why do we normalize metadata? • Improve services (e.g. search results, or UI display) • Improve metadata quality, when possible • Enhance predictability of data for reharvesting services

  5. How do we normalize metadata? • Perform “safe” transforms to “smarten up” metadata • XSL stylesheets -- from your XML metadata to our normalized XML metadata • Principles: • Do no harm (Don’t lose information) • Add information, when possible • Indicate schemes for valid values • Remove meaningless text • “…”, “not available”, “-” • Empty elements • Correct wrong information • “text/pdf”  “application/pdf” • Remove characters that impede functionality or display • Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) • Scrub URLs

  6. Automated MR Ingest process • Your collection info and harvesting info is registered • OAI validation – can we run our harvester on your OAI server? (see handout) • OAI harvest of your metadata (nsdl_dc if available; oai_dc if not ...) • XML schema validation of all of your metadata • UTF-8 encoding validation, and make bad UTF-8 chars into harmless ones. • Normalized nsdl_dc created. • Your metadata, “raw” and normalized, is loaded into the MR tables and made available to the NSDL’s MR OAI server.

  7. Automated MR ingest process Your collection’s OAI server Validation OAI Harvest NSDL Collection Registration “raw” or “native” metadata Validation Normalize normalized metadata NSDL MR OAI server Metadata Repository Notify collection of problems; May need to halt processing

  8. OAI-PMH: Key points • OAI-PMH requests are embedded in HTTP • it’s a web service, not a flat file • XML, not HTML • multiple metadata formats are allowed • OAI ≠ simple DC only! • Each metadata format MUST have a valid XML schema

  9. Metadata Formats and Schemas

  10. MR ingest requires: compliant OAI 2.0 server • Correctly implements OAI-PMH; queries to all verbs respond correctly. • Every OAI response must be (deeply) XML schema valid • Encodes properly in proper places • XML encoding • URL encoding • UTF-8 encoding

  11. OAI 2.0 – Identify • baseURL • email address • protocol version • description for OAI identifier syntax, especially if adhering to oai-identifier syntax described in Implementation Guidelines

  12. OAI 2.0 – ListMetadataFormats • correct XML namespace for each format • a valid XML schema for each format • targetNamespace MUST match XML namespace above • super easy out: use oai_dc • easy out: use nsdl_dc

  13. OAI 2.0 – ListSets • super easy out: if all your metadata is NSDL relevant, don’t use sets for our sake. • if you want the NSDL to harvest only SOME of your OAI server’s metadata, then use sets. • We will harvest only the sets you specify … but our default is to harvest all of them. • super easy setSpec strings: use only alpha-num characters

  14. OAI 2.0 – ListRecords • Every metadata record served must (deeply) validate to its indicated XML schema • If used, resumptionTokens must be implemented properly • RT is an exclusive argument • Last response has an empty RT • Selective Harvesting works properly • “from” and “until” arguments do limit the results appropriately • “set” arguments do limit the results appropriately, if implemented

  15. Common Points of Confusion - 1 about the metadata vs. about the resource • identifiers: OAI vs. DC • record/header/identifier vs. record/metadata/../dc:identifier • dates: OAI vs. DC • record/header/datestamp vs. record/metadata/../dc:date • OAI about containers are about the metadata • rights: OAI about vs. DC • record/about/../(dc:rights?)vs. record/metadata/../dc:rights

  16. OAI identifiers • Must uniquely identify individual metadata records at your site for OAI harvest and OAI reharvest • Must stay the same for your metadata records • metadata is updated; OAI identifier unchanged

  17. Common Points of Confusion - 2 • Dates • format confusion • OAI dates must be encoded as ISO8601 and must be in UTC (≈ GMT) • OAI-PMH allows YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ. • DC date encoding – “Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format.” • <responseDate> (All OAI-PMH responses) • Time when OAI server responds to a request • OAI-PMH sez: ‘must be the time and date of the response in UTC.  This is encoded using the "Complete date plus hours, minutes, and seconds" variant of ISO8601 . This format is YYYY-MM-DDThh:mm:ssZ.’ • <datestamp> (OAI-PMH <record>/<header>) • “from” and “until” arguments in OAI requests • <dc:date>

  18. When a Collection Deletes Records • if not indicated in OAI server • incremental harvest for MR never shows update; MR copy never deleted! • if indicated in OAI server transiently • reharvested soon enough –  • not reharvested soon enough – incremental harvest for MR never shows update; MR copy never deleted! • if OAI server indicated and persistent • MR finds delete on incremental harvest – 

  19. Deleted Records – Our Solution “Full reharvest” • Mark all the site’s records in MR “deleted” • Harvest all metadata records for the collection • As we ingest each newly retrieved record into the MR, if we over-write an old record, “un-delete” it. • Expensive • network bandwidth • processing time • Okay for small collections (under ~15,000) • Okay for metadata that changes infrequently

  20. In an ideal world, we’d like • nsdl_dc • Information about nsdl_dc, example records and its XML schemas is in the NSDL Metadata Primer. • Persistent deleted records • OAI identifier syntax, per OAI Implementation Guidelines

More Related