880 likes | 1.16k Views
Introduction to the Open Archives Initiative Protocol for Metadata Harvesting. Timothy W. Cole ( t-cole3@uiuc.edu ), Mathematics Librarian William H. Mischo ( w-mischo@uiuc.edu ), Engineering Librarian Thomas G. Habing ( thabing@uiuc.edu ), Research Programmer
E N D
Introduction to the Open Archives Initiative Protocol for Metadata Harvesting Timothy W. Cole (t-cole3@uiuc.edu), Mathematics Librarian William H. Mischo (w-mischo@uiuc.edu), Engineering Librarian Thomas G. Habing (thabing@uiuc.edu), Research Programmer Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign Presented 27 May 2003 in conjunction with JCDL 2003, Houston, TX http://dli.grainger.uiuc.edu/Publications/TWCole/JCDL-OAI
Today’s Agenda (Part 1) • Overview of OAI (Mischo) • What it is, where it comes from, what it’s used for • Relation to HTTP, XML, Dublin Core, & Z39.50 • Basic Concepts & Definitions (Cole) • OAI verbs • OAI transactions • Protocol details & architecture options • Illustrations • Implementation Guidelines for Repositories (Cole) • Tools & program layout options • Metadata generation / mapping • Optional protocol elements • Error handling & deleted records JCDL 2003OAI Intro / t-cole3@uiuc.edu
Today’s Agenda (Part 2) • Tools, testing, & problems (Cole) • XML & OAI validation tools • Common problems • Implementation Guidelines for Harvesters (Mischo) • How to harvest • Harvesting policies & strategies • Harvester Technologies • Advanced topics (Cole) • Communities • OAI Static Repository • OAI & SOAP • Where do you go from here? JCDL 2003OAI Intro / t-cole3@uiuc.edu
OAI as a tool • All about moving metadata around • Designed to be a building block, useable by many different communities • Can facilitate (in some cases enable) services & functions • Assumes widely distributed content, butcentralized indexing(!) & services • Build once, use for many applications • Focus of OAI is interoperability JCDL 2003OAI Intro / t-cole3@uiuc.edu
Metadata vs. Information Resources • Resource refers to information objects or digital representations of information objects • Metadata item is a collection of properties about a resource (e.g. title, author, etc.) • Metadata record is a metadata item expressed in a specific syntax according to an XSD • OAI focuses on metadata, with the implicit understanding that metadata contains useful links to the source information object(s) JCDL 2003OAI Intro / t-cole3@uiuc.edu
OAI Antecedents • Call to other E-Print archives (July 1999) Paul Ginsparg, Rick Luce, & Herbert Von de Sompel: “…mobilize core group to work towards achieving a universal service for author self-archived scholarly literature.” • Santa Fe Mtgs. (Oct. 1999 & June 2000) • OAI – PMH version history: • First Alpha Release, Sept. 2000 • 1.0 (Beta) Release January 2001 • 1.1 (Beta 2) Release July 2001 • 2.0 (Production) Release June 2002 JCDL 2003OAI Intro / t-cole3@uiuc.edu
Original OAI Organization • OAI Executive: • Carl Lagoze & Herbert Van de Sompel • OAI Steering Committee: • Co-Chairs: Dan Greenstein, Cliff Lynch • OAI Technical Committee • Funded by NSF, DLF & CNI • Seeks to be user community driven • Adopters (selective list): • NSDL, NDLTD, Open Archives Forum (EU), JISC/DNER (UK) • E-Prints.Org, DLXS, DSpace, ContentDM, ENCompass JCDL 2003OAI Intro / t-cole3@uiuc.edu
OAI Protocol for Metadata Harvesting • Harvesting approachto interoperabilityat metadata level • Divides world intoMetadata Providers& Service Providers • Builds on HTTP,XML, & Dublin Core http://www.openarchives.org/ JCDL 2003OAI Intro / t-cole3@uiuc.edu
Harvesting/Federation vs. Broadcast • Competing approaches to interoperability • Distributed/Broadcast searching: search and discovery over remote services and data • Harvesting is when data/metadata is transferred from the remote source to the destination where the services are located (e.g. Union catalogs) • OAI designed to make it easy for providers • Low barrier design • OAI focuses on harvesting JCDL 2003OAI Intro / t-cole3@uiuc.edu
Data and Service Providers • Data Providers (Repositories) refer to entities who possess resources & metadata and are willing to share metadata with others via well-defined OAI protocols • Service Providers (Harvesters) are entities who harvest metadata from Data Providers in order to supply higher-level services to users (e.g. search & discovery) • OAI uses these denotations for its client/server model (data=server, service=client) JCDL 2003OAI Intro / t-cole3@uiuc.edu
Reliance on HTTP & XML • OAI-PMH is a REpresentational State Transfer (REST) protocol (unlike RPC, SOAP) • OAI requests and responses are sent via the HTTP protocol • OAI Requests are encoded as HTTP GET or POST operations • OAI Responses are valid XML documents JCDL 2003OAI Intro / t-cole3@uiuc.edu
XML Namespaces and Schema • Consistency and data “quality” is ensured by using XML Schema Definitions (XSD) for all responses • XML Namespaces are used where necessary to clearly define which parts of the responses are actual metadata and which support the Metadata Harvesting Protocol JCDL 2003OAI Intro / t-cole3@uiuc.edu
OAI-PMH Use of Dublin Core • DC is OAI’s lowest common denominator • OAI supports & encourages use of other, community-driven metadata schemas • Typically, metadata provider stores metadata in ‘best’ schema as dictated by material & resources • Crosswalk (semantic mapping) to simpler schemas • Semantic mapping at metadata delivery (rather than at time of search) • As with Z39.50, can’t search for what’s not there JCDL 2003OAI Intro / t-cole3@uiuc.edu
As Compared to Z39.50 JCDL 2003OAI Intro / t-cole3@uiuc.edu
What OAI Is Not • Not search • Not database • Not metadata • Not OAIS JCDL 2003OAI Intro / t-cole3@uiuc.edu
What OAI is good for • Where content is widely distributed, in different kinds of non-Z39.50 enabled locations • Metadata provider more lightweight than Z39.50 • Metadata provider scales wellService provider scales according to search capability • Metadata is sufficient for services desired • Normalization, dedupping, augmentation desired Not mutually exclusive • Portals can use both Z39.50 & OAI JCDL 2003OAI Intro / t-cole3@uiuc.edu
The NSDL metadata repository Services The metadata repository is a resource for service providers. It holds information about every collection and item known to the NSDL. Users Metadata repository From “The NSDL Metadata Strategy,” A presentation by William Y. Arms and Diane I. Hillman. Available: http://nsdl.comm.nsdlib.org/allprojects01/metastrategy.ppt Collections JCDL 2003OAI Intro / t-cole3@uiuc.edu
NSDL Metadata strategy• Support eight standard formats • Collect all existing metadata in these formats • Provide crosswalks to Dublin Core • Expose records in the metadata repository for service providers to harvest • Concentrate human effort on collection-level metadata • Use automatic generation to augment item-level metadata From “The NSDL Metadata Strategy,” A presentation by William Y. Arms and Diane I. Hillman. Available: http://nsdl.comm.nsdlib.org/allprojects01/metastrategy.ppt JCDL 2003OAI Intro / t-cole3@uiuc.edu
IMLS Digital Collections & Content • Build a registry of all National Leadership Grant collections with digital content. • Assist and guide NLG projects in making item-level metadata sharable using OAI. • Build a repository and search & discovery tools for integrated access to the content of NLG collections (unique metadata schema?). • Research best practices for sharing metadata about diverse digital content and for supporting the interests of diverse user communities. JCDL 2003OAI Intro / t-cole3@uiuc.edu
http://imlsdcc.grainger.uiuc.edu/ JCDL 2003OAI Intro / t-cole3@uiuc.edu
Open Language Archive Community • Supports the “OLAC Protocol for Metadata Harvesting” – based on OAI • Includes metadata extensions to DC • Supports Qualified DC refinements and encodings and unique OLAC attribute “code” to hold restricted element values • Also supports “OLAC Static Repository Gateway” – based on OAI Static Repository (still alpha) • Developing an “OLAC Repository Editor” for creating a metadata provider JCDL 2003OAI Intro / t-cole3@uiuc.edu
Basic Concepts & Definitions • OAI verbs • OAI transactions • Protocol Details • Architecture Options • Illustrations JCDL 2003OAI Intro / t-cole3@uiuc.edu
How OAI Works OAI “VERBS” Identify ListMetadataFormats ListSets ListIdentifiers ListRecords GetRecord Service Provider Metadata Provider H A R VESTER REPOSITORY OAI HTTP Request OAI (OAI Verb) HTTP Response (Valid XML) JCDL 2003OAI Intro / t-cole3@uiuc.edu
Identify • Purpose • Return general information about the archive and its policies (e.g., datestamp granularity) • Parameters • None • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=Identify JCDL 2003OAI Intro / t-cole3@uiuc.edu
ListSets • Purpose • Provide a listing of sets in which records may be organized (may be hierarchical, overlapping, or flat) • Parameters • None • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListSets JCDL 2003OAI Intro / t-cole3@uiuc.edu
ListMetadataFormats • Purpose • List metadata formats supported by the archive as well as their schema locations and namespaces • Parameters • identifier – for a specific record (O) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListMetadataFormats JCDL 2003OAI Intro / t-cole3@uiuc.edu
ListIdentifiers • Purpose • List headers for all items corresponding to the specified parameters • Parameters • from – start date (O) • until – end date (O) • set – set to harvest from (O) • metadataPrefix – metadata format to list identifiers for (R) • resumptionToken – flow control mechanism (X) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListIdentifiers&metadataPrefix=oai_dc JCDL 2003OAI Intro / t-cole3@uiuc.edu
GetRecord • Purpose • Returns the metadata for a single item in the form of an OAI record • Parameters • identifier – unique id for item (R) • metadataPrefix – metadata format for the record (R) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=GetRecord&identifier=oai:test:123&metadataPrefix=oai_dc JCDL 2003OAI Intro / t-cole3@uiuc.edu
ListRecords • Purpose • Retrieves metadata records for multiple items • Parameters • from – start date (O) • until – end date (O) • set – set to harvest from (O) • resumptionToken – flow control mechanism (X) • metadataPrefix – metadata format (R) • Sample URL • http://www.anarchive.org/cgi-bin/OAI?verb=ListRecord&metadataprefix=oai_dc&from=2001-01-01 JCDL 2003OAI Intro / t-cole3@uiuc.edu
Protocol Details • OAI Transaction == An OAI request (HTTP) & corresponding OAI response (XML) • Optional: use resumptionToken & other flow control mechanisms to manage service load • Item Identifiers – Persistence & Uniqueness • Item Datestamps – Date of last metadata change; supports selective harvesting JCDL 2003OAI Intro / t-cole3@uiuc.edu
Examples of OAI Requests http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify http://publications.uu.se/portal/OAI?verb=ListSets http://www.language-archives.org/cgi-bin/olaca3.pl?verb=ListMetadataFormats http://www.language-archives.org/cgi-bin/olaca3.pl?verb=ListIdentifiers&metadataPrefix=oai_dc&from=2002-12-01 http://www.language-archives.org/cgi-bin/olaca3.pl?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai%3Aacl.sr.language-archives.org%3AA00-1006 JCDL 2003OAI Intro / t-cole3@uiuc.edu
An OAI Response <?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns=… xmlns:xsi=… xsi:schemaLocation=…> <responseDate>2002-05-01T19:20:30Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv:hep-th/9901001“ metadataPrefix="oai_dc"> http://an.oa.org/OAI-script</request> <GetRecord> <record> ... </record> </GetRecord> </OAI-PMH> JCDL 2003OAI Intro / t-cole3@uiuc.edu
An OAI Record <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2002-02-28</datestamp> <setSpec>cs</setSpec> </header> <metadata> <oai_dc:dc xmlns…> <dc:title>Using Structural Metadata…</dc:title> … </oai_dc:dc> </metadata> <about> <provenance xmlns…> …. </provenance> </about> JCDL 2003OAI Intro / t-cole3@uiuc.edu
Unique Identifiers • Each item must have a unique identifier • Identifiers must follow rules for valid URIs • Example: • oai:<archiveId>:<recordId> • oai:etd.vt.edu:etd-1234567890 • Each identifier must resolve to a single item and always to the same item • Can’t reuse OAI item identifiers JCDL 2003OAI Intro / t-cole3@uiuc.edu
Datestamps • Needed for every OAI record to support incremental harvesting • Must be updated when addition or modification or deletion made in order to ensure changes are correctly propagated to harvesters • Different from dates within the metadata – OAI datestamp is used only for harvesting • Can be either YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ (must be GMT timezone) JCDL 2003OAI Intro / t-cole3@uiuc.edu
HTML <meta> XML DBMS DBMS OAI Application (CGI, ASP, PHP, etc.) Webserver - HTTP OAI Provider Architectures Descriptive Metadata OAI Administrative Metadata OAI Harvesters JCDL 2003OAI Intro / t-cole3@uiuc.edu
Architecture Options • Metadata items in database • If individual metadata items are stored in a database • Usually requires programmatic mapping to DC • Metadata items as XML files • If individual metadata items already in XML, can do without the database component, or can use database to cache and/or hold OAI administrative metadata • May use XSLT stylesheets to extract / map metadata • Metadata elements in HTML files • As with XML file system options • “Static” repository option (more later) JCDL 2003OAI Intro / t-cole3@uiuc.edu
Technology Options • WWW Server (e.g., Apache, MS IIS) • Protocol may be implemented in many forms • CGI Script (Perl, C++, Java) • Java Servlet • PHP • Metadata (e.g. database) access mechanism required • See www.openarchives.org for list of publicly available software templates • See www.SourceForge.Net for UIUC OAI tools JCDL 2003OAI Intro / t-cole3@uiuc.edu
Illustrations • Identify • ListSets • ListMetadataFormats • ListIdentifiers • GetRecord oai_dc • GetRecord olac • ListRecords • Error JCDL 2003OAI Intro / t-cole3@uiuc.edu
***15 Minute Break *** JCDL 2003OAI Intro / t-cole3@uiuc.edu
Implementation Guidelines for Repositories • Tools Required • Basic program layout (incl. object-oriented approaches) • Optional container elements • Metadata generation / mapping, data cleaning • Sets • resumptionToken, flow control, load-balancing • Denial-of-service prevention • Error handling • Deleted metadata records JCDL 2003OAI Intro / t-cole3@uiuc.edu
Typical Pre-Requisites • Metadata & Web server • Code templates if available (available for many languages) • Basic Web programming environment • XML parsers (for non-trivial encoding) • Database access libraries/drivers (e.g. ODBC, JDBC) JCDL 2003OAI Intro / t-cole3@uiuc.edu
Basic program layout parse WWW request to extract parameters if (verb=‘Identify’) Validate arguments; ProcessIdentify; else if (verb=‘ListMetadataFormats’) Validate arguments; ProcessListMetadataFormats; else if (verb=‘ListSets’) Validate arguments; ProcessListSets; else if (verb=‘GetRecord’) Validate arguments; ProcessGetRecord; else if (verb=‘ListIdentifiers’) Validate arguments; ProcessListIdentifiers; else if (verb=‘ListRecords’) Validate arguments; ProcessListRecords; else ReportError (‘badVerb’); Re-usable subroutines to extract / clean up / transform metadata, generate standard error messages, etc. JCDL 2003OAI Intro / t-cole3@uiuc.edu
Object-Oriented Approaches • Cleaner separation of protocol, database access and metadata generation • Example approaches • Each service request is handled by a object • Simpler incremental development • Protocol, Database and Metadata are objects • Greater portability of code • Inheritance from a basic OAI data provider JCDL 2003OAI Intro / t-cole3@uiuc.edu
Provider Performance Issues • Database design impacts performance • Work required to map to DC • Use of resumptionTokens way to improve performance • Fetch only records needed to satisfy current request • Queries only retrieve needed records • resumptionTokens should retain state information for best performance and for idempotency JCDL 2003OAI Intro / t-cole3@uiuc.edu
Optional Container Elements • <Identify><description> • Additional information about repository • oai-identifier, eprints, friends, branding, other… • <ListSets><setDescription> • Additional information describing a set • <metadata> • Other metadata besides Dublin Core • rfc1807, marc21, oai_marc, mods, other… • <about> • Meta-metadata, i.e. record level rights JCDL 2003OAI Intro / t-cole3@uiuc.edu
Metadata Generation / Mapping • Approaches • Map from source to each metadata format • Use multiple crosswalks (may use XSLT) to transform to multiple metadata formats source (e.g., DB) dc rfc1807 name title title = = author creator author = = JCDL 2003OAI Intro / t-cole3@uiuc.edu
Data Cleaning • Escape special XML characters (<, >, &, “) • Convert to UTF-8 version of Unicode • Convert entity references (e.g., ©) • Remove extraneous whitespace • URLs • /?#=&:;+ must be encoded as escape sequences JCDL 2003OAI Intro / t-cole3@uiuc.edu
Sets – another option for selective harvesting • Optional: no well-defined semantics – depends completely on local data providers • Must provide setSpec & setName, may provide setDescription, for each Set in repository • Sets may be hierarchical (use “:”); may overlap • Allows for harvesting of sub-collections • May be pre-defined by arrangement between data providers and service providers • E.g. Subject areas, years, author names (but must be pre-defined – for ListSets) • Not a substitute for searching! JCDL 2003OAI Intro / t-cole3@uiuc.edu
resumptionToken, flow control, load-balancing • Incomplete response: resumptionToken can be used to return partial results – the client is issued with a token which may be presented to the server to receive more results • resumptionToken embeds state information, allowing OAI to be stateless even for incomplete response model • HTTP 503 “retry-after” mechanism can be used to support server-side delaying of a client’s request • HTTP 302 / 303 can be used for load balancing • HTTP 4xx can be used to deny a harvester JCDL 2003OAI Intro / t-cole3@uiuc.edu