1 / 63

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Alon Kadury. Content. Reminders History OAI overview Technical introduction Conclusions Demonstrations Resources. Definition- A Digital Library is a:. 1. Collection of digital objects

lovey
Download Presentation

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Archives Initiative Protocol for Metadata Harvesting(OAI-PMH) Alon Kadury

  2. Content • Reminders • History • OAI overview • Technical introduction • Conclusions • Demonstrations • Resources

  3. Definition- A Digital Library is a: 1. Collection of digital objects 2. Collection of knowledge structures 3. Collection of library services 4. Domain/Focus/Topic 5. Quality Control 6. Preservation/Persistence

  4. Types of DLs • Single Digital Library (SDL) • also Stand-alone, Self-contained • Federated Digital Library (FDL) • also confederated, distributed • Harvested Digital Library (HDL)

  5. Single Digital Library (SDL) • A regular DL • Self-contained material: • purchased • scanned/digitized • Usually localized

  6. Federated Digital Library (FDL) • Contains many autonomous libraries • Usually heterogeneous repositories • Connected via network • Forms a virtual distributed library • Transparent user interface • The major problem is interoperability.

  7. Harvested Digital Library (HDL) • Does not contain data, just metadata • Objects harvested into summaries • Regular DL characteristics: • fine granularity • rich library services • high quality control • annotated

  8. History • As the Web evolved, the number of Web sites and search engines increased.A similar process happened with e-prints and digital libraries. • The changes in the amount of DLs led to the development of the OAI-PMH protocol as we’re about to see.

  9. History - Problems The development of e-prints and digital libraries let to several problems like: • Many user interfaces -Each DL offered Web interface for deposit of articles and for end-user searches.The result: Difficult for end users to work across archives without having to learn multiple different interfaces.

  10. History - Problems • Different queries’ syntax -The result: Difficult for the user to keep track of the searching syntax of each SDL and difficult to create an FDL that could query many SDLs. • Many metadata formats -SDL metadata could be kept in any format the SDL wanted.The result: Hard times for the FDLs which had to know the formats of each SDL they are harvesting.

  11. History – Possible solutions • The problems led researchers to recognise the need for single search interface to all archives - Universal Pre-print Service (UPS). • Two possible approaches to building the UPS where considered:

  12. History – Solution 1 Cross-searching multiple archive:In this approach a client sends requests to several servers and then combines the data.The client and server work with a known and agreed protocol (for example Z39.50).However, studies showed this approach is not the preferred approach for distributed searching of large values of nodes mainly due to problems like knowing which collections to search and performance issues.

  13. History – Solution 2 Harvesting metadata into a ‘Central Server’:This approach harvests the metadata and stores it in a central server, on which searches are made. • The idea was demonstrated in a convention held at Santa Fe NM, October 21-22, 1999. • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ More reading: http://www.dlib.org/dlib/february00/02contents.html

  14. OAI overview- definitions Lets start with a few definitions: • Interoperability • Open Archive Initiative (OAI) • Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)

  15. OAI overview- definitions • What is Interoperability? • Interoperability refers to the ability of two or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results.

  16. OAI overview- definitions • In order to exchange data we need to agree on things like: • requests format • results format • transport protocols (HTTP vs FTP vs….) • Metadata formats (DC vs MARC vs…) • Usage rights (who can do what with the records) • We need someone to organize it and “set the rules”.

  17. OAI overview- definitions • Who will organize it? • Open Archive Initiative -“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (http://www.openarchives.org/organization/index.html)

  18. OAI overview- definitions • What will the interoperability standards be called? Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH)

  19. OAI overview- Key players • When talking about OAI-PMH we see three main players: • Data Providers • Service Providers • The protocol (OAI-PMH)

  20. OAI overview- Data Provider • Data Provider: • Handles deposit/publishing of resources in archive. • Expose metadata about resources in archive (using the OAI-PMH protocol\interface). • Data Providers may support any metadata format, but must support the metadata format Dublin Core (DC). • Offer free access to the archives (at least the metadata). • A network accessible server, able to process OAI-PMH requests correctly is often called a Repository.

  21. OAI overview - Service Provider • Service Provider: • Harvest metadata from data providers and use it to offer single user-interface across all harvested metadata. • May enrich metadata. • Offer (value-added) services on the basis of the metadata. • Client application issuing OAI-PMH requests is often referred to as a Harvester.

  22. OAI overview - Providers

  23. Native end-user interface Service Provider Native harvesting interface Native harvesting interface Data Provider Input interface Data Provider Native end-user interface Native end-user interface optional (e.g., RePEc) OAI overview - Providers Input interface

  24. Data providers Harvesting based on OAI-PMH Service providers OAI overview - Providers

  25. Web interfaces Layer 4 Service Provider - FDL\HDL Layer 3 OAI-PMH SDL SDL SDL Layer 2 Web Layer 1 OAI overview - Model

  26. Technical introduction Since the days of the Santa Fe convention the protocol had several versions. Version 2.0 is the latest and is considered stable.The technical introduction refers to this version.

  27. Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0 stable nature experimental experimental Dienst verbs OAI-PMH OAI-PMH requests HTTP GET/POST HTTP GET/POST HTTP GET/POST XML responses XML XML transport HTTP HTTP HTTP unqualified Dublin Core unqualified Dublin Core metadata OAMS document like objects resources about eprints metadata harvesting metadata harvesting metadata harvesting model Tech’- protocol versions

  28. The requests of the protocol are HTTP based. The response contents of the protocol are XML based. Question: why? Answer: Simple protocol based on existing standards which allows rapid development & effortless implementation. Systems can be deployed in variety of configurations. Low barrier interoperability specification. Internet/Firewall friendly. Tech’- request & response

  29. Requests (based on HTTP) Metadata Metadata (Documents) „Service” Metadata (encoded in XML) Harvester Repository Service Provider Data Provider Tech’- request & response There are six request types which are called verbs. The request type and additional information are passed as parameters using HTTP POST or GET methods.

  30. Lets see a demonstration about how we can create a FDL and then we will look at the backstage of it. Demo

  31. Data Provider e-prints Requests: Identify ListMetadataformats ListSets ListIdentifiers ListRecords GetRecord Repository Data Provider Images Repository Service Provider Data Provider OPAC Repository Data Provider Harvester Data Provider Responses: General information Metadata formats Set structure Record identifier Metadata Museum Repository Data Provider Archive Repository Tech’– more definition

  32. Tech’–Request Types • Six different request types • Identify • ListMetadataFormats • ListSets • ListIdentifiers • ListRecords • GetRecord • Harvester does not have to use all types. • Repository must implement all request types fully (all required and optional arguments for each of the requests).

  33. Tech’- Request Type: Identify functionretrieve description and general information about an archive. examplearchive.org/oai-script?verb=Identify parametersnone errors / exceptionsbadArgumente.g. archive.org/oai-script?verb=Identify&set=biology

  34. Tech’- Request Type: Identify Response format

  35. Tech’- Request Type: Identify Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify

  36. Tech’- Request Type: ListMetadataFormats functionretrieve available metadata formats from archive.Remember that each archive must implement at least DC. examplearchive.org/oai-script?verb=ListMetadataFormats parametersidentifier (optional) errors / exceptionsbadArgumentidDoesNotExist e.g. archive.org/oai-script?verb=ListMetadataFormats&identifier=really-wrong-identifiernoMetadataFormats

  37. Tech’- Request Type: ListMetadataFormats Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats

  38. Tech’- Request Type: ListSets • Q: What are Sets?A: Sets are logical partitioning of repositories. • Q: Why use sets?A: Sets function was aimed to enable selective harvesting. • Data providers don’t have to define sets. • Sets are not strictly hierarchical.

  39. Tech’- Request Type: ListSets functionretrieve set structure of a repository examplearchive.org/oai-script?verb=ListSets parametersresumptionToken (exclusive) errors / exceptionsbadArgumentbadResumptionTokene.g. archive.org/oai-script?verb=ListSets&resumptionToken=any-wrong-token noSetHierarchy

  40. Tech’- Request Type: ListSets Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets

  41. Tech’- Request Type: ListIdentifiers functionabbreviated form of ListRecords, retrieving only headers examplearchive.org/oai-script?verb=ListIdentifiers&metadataPrefix=oai_dc&from=2002-12-01 parametersfrom(optional)until(optional)metadataPrefix(required)set(optional) resumptionToken (exclusive) errors / exceptionsbadArgument, e.g. …&from=2002-12-01-13:45:00badResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy

  42. Tech’- Request Type: ListIdentifiers Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc

  43. Tech’- Request Type: ListRecords functionharvest records from a repository examplearchive.org/oai-script?verb=ListRecords&metadataPrefix=oai_dc&set=biology parametersfrom(optional)until(optional)metadataPrefix(required)set(optional) resumptionToken (exclusive) errors / exceptionsbadArgumentbadResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy

  44. Tech’- Request Type: GetRecord functionretrieve individual metadata record from a repository examplearchive.org/oai-script?verb=GetRecord&identifier=oai:HUBerlin.de:3000218&metadataPrefix=oai_dc parametersidentifier(required)metadataPrefix(required) errors / exceptionsbadArgumentcannotDisseminateFormatidDoesNotExist

  45. resource all available metadata about David item = identifier item Dublin Core metadata MARC metadata SPECTRUM metadata records Tech’- Records, items & DCor setting the record straight

  46. Tech’- Records, items & DC A record consists of: • Header (mandatory) • identifier (1) • datestamp (1) • setSpec elements (*) • status attribute for deleted item (?) • Metadata (mandatory) • XML encoded metadata with root tag, namespace • repositories must support Dublin Core • About (optional) • rights statements • provenance statements

  47. Tech’- Records, items & DC • OAI-PMH supports dissemination of multiple metadata formats from a repository. • Properties of metadata formats: • id string to specify the format (metadataPrefix) • metadata schema URL (XML schema to test validity) • XML namespace URI (global identifier for metadata format) • Repositories must be able to disseminate unqualified DC. • Arbitrary metadata formats can be defined and transported via the OAI-PMH. • Returned metadata must comply with XML namespace specification.

  48. Tech’- Records, items & DC As mentioned before the minimum standard is unqualified Dublin Core (http://dublincore.org/). • Dublin Core Metadata Element Set contains 15 elements. • All elements are optional. • All elements may be repeated. The Dublin Core Metadata Element Set:

  49. Tech’- Records, items & DC Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc

  50. Tech’- Flow control • Some of the request commands can generate a very long response (for example think about requesting a CiteSeer or Library of Congress to list ALL their records using the GetRecords verb). • In order not to generate long responses that will over load the server, a flow control mechanism was added to the protocol. • It is only within the server responsibility to split long responses into shorter ones; the client has no control over length of the responses.

More Related