200 likes | 380 Views
The OAI Protocol for Metadata Harvesting. Andy Powell a.powell@ukoln.ac.uk UKOLN, University of Bath IVOA Registry Meeting, London March 2003. Contents. a brief history of OAI 10 technical things you should know about the OAI-PMH. OAI roots.
E N D
The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN, University of Bath IVOA Registry Meeting, London March 2003
Contents • a brief history of OAI • 10 technical things you should know about the OAI-PMH
OAI roots • the roots of OAI lie in the development of eprint archives… • arXiv, CogPrints, NACA (NASA), RePEc, NDLTD, NCSTRL • each offered Web interface for deposit of articles and for end-user searches • difficult for end-users to work across archives without having to learn multiple different interfaces • recognised need for single search interface to all archives • Universal Pre-print Service (UPS)
Searching vs. harvesting • two possible approaches to building a single search interface to multiple eprint archives… • cross-searching multiple archives based on protocol like Z39.50 • harvesting metadata into one or more ‘central’ services – bulk move data to the user-interface • US digital library experience in this area indicated that cross-searching not preferred approach • distributed searching of N nodes viable, but only for small values of N
search service …or… search service Searching vs. harvesting
Harvesting requirements • in order that harvesting approach can work there need to be agreements about… • transport protocols – HTTP vs. FTP vs. … • metadata formats – DC vs. MARC vs. … • quality assurance – mandatory elements, mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice • intellectual property and usage rights – who can do what with the records • work in this area resulted in the “Santa Fe Convention”
Development of OAI-PMH • 2 year metamorphosis thru various names • Santa Fe Convention, OAI-PMH versions 1.0, 1.1… • OAI Protocol for Metadata Harvesting 2.0 • development steered by international technical committee • inter-version stability helped developer confidence • move from focus on eprints to more generic protocol • move from OAI-specific metadata schema to mandatory support for DC
Bluffer’s guide to OAI http://www.openarchives.org/ • OAI-PMH is a low-cost mechanism for harvesting metadata records • from ‘data providers’ to ‘service providers’ • allows ‘service provider’ to say ‘give me some or all of your metadata records’ • where ‘some’ is based on date-stamps, sets, metadata formats • not limited to repositories of eprints • images, museum artefacts, learning objects, … • based on HTTP and XML • simple, Web-friendly, autonomous • fast, flexible deployment
Bluffer’s guide to OAI • OAI-PMH is not a search protocol • but use can underpin search-based services based on Z39.50 or SRW or SOAP or… • OAI-PMH carries only metadata • content (e.g. full-text or image) made available separately – typically at URL in metadata • mandates simple DC as record format • but extensible to any XML format – IMS, ONIX, MARC, METS, etc. • extensible framework for metadata about • repository, resources, ‘items’, sets • can include rights metadata
Bluffer’s guide to OAI • metadata and ‘content’ often made freely available – but not a requirement • OAI-PMH can be used between closed groups • or, can make metadata available but restrict access to content in some way • underlying HTTP protocol provides • access control – e.g. HTTP BASIC • compression mechanisms (for improving performance of harvesters) • could, in theory, also provide encryption if required
resource Resources, items and records all available metadata about David item = identifier item Dublin Core metadata MARC metadata SPECTRUM metadata records
Protocol requests • six different request types • Identify • ListMetadataFormats • ListSets • ListIdentifiers • ListRecords • GetRecord • harvester need not use all types • repository must implement all types • required and optional arguments • on request types
Record structure • metadata about a resource in a particular XML format • header (mandatory) • identifier (1) • datestamp (1) • setSpec elements (*) • status attribute for deleted item (?) • metadata (mandatory) • XML encoded metadata within root tag which provides namespace and schema • repositories must support Dublin Core • about (optional) • rights statements • provenance statements
Dublin Core http://dublincore.org/ • OAI-PMH mandates use of simple DC as lowest common denominator • agreed XML schema – ‘oai_dc’ • simple DC – 15 metadata properties • all DC properties optional and repeatable
OAI demonstration • repository explorer demo
OAI and Google eprint archive(s) Web site(s) multimedia database(s) DP9 gateway OAI gatewaymakes harvested metadata available to Google…
Implementing OAI • OAI protocol is relatively simple • implementation and deployment tends to be very fast • lots of available toolkits • Java, Perl, PHP, etc. • complete tools also available • e.g. tools that sit in front ofexisting databases • see ‘tools’ area on theOAI Web site…
Creative Commons http://www.creativecommons.org/ • CC is “devoted to expanding the range of creative work available for others to build upon and share” • provides ‘standard’ licences for content • attribution • noncommercial • no derivative works • share alike • mechanisms for indicating licence on Web pages • need similar mechanism in OAI