360 likes | 565 Views
Agenda. Introduction ? Unicorn interfacesPart 1: An OAI frontend for UnicornPart 2: An SRU frontend for UnicornShort description of OAI and SRU protocolsOverview of technical implementationUse cases and demos. Introduction. OAI and SRU are ?open' protocols that permit exchange of metadata bet
E N D
1. Extracting XML from Unicorn with OAI and SRU European Unicorn User Group Conference
Glasgow Caledonian University
September 7th & 8th, 2006
2. Agenda Introduction – Unicorn interfaces
Part 1: An OAI frontend for Unicorn
Part 2: An SRU frontend for Unicorn
Short description of OAI and SRU protocols
Overview of technical implementation
Use cases and demos
3. OAI and SRU are ‘open’ protocols that permit exchange of metadata between information systems
Well-known Unicorn interfaces:
Unicorn API server
Unicorn Webcat/iBistro/iLink server
Unicorn Z39.50 server
All comply to the philosophy of request/response sequences
7. API: Proprietary
low interoperability level
HTML: Record data not well structured
low reusability level
Z39.50: Protocol specific
more difficult to implement (high learning curve)
Z39.50 is statefull
?Difficult to integrate into today’s web services environments
?communication: use HTTP
?information exchange: use open protocols (like OAI and SRU)
?record data structure: use XML (according to well-defined XML Schema)
8. HTTP / Open / XML
OAI-PMH: Open Archives Initiative – Protocol for Metadata Harvesting
SRU: Search and Retrieve via URL
10. ‘Harvester collects metadata from archives’
Stateless protocol: sequence of OAI requests/responses over HTTP
Just harvesting -- NOT searching
11. OAI requests
HTTP GET|POST requests
Syntax
BASE URL
host + port + path of OAI request handler
key=value pairs
Examples:
http://www.cible.ulb.ac.be:80/cgi-bin/OAI20/catalog?verb=Identify _
http://www.biomedcentral.com/oai/1.1/bmcoai.asp?verb=GetRecord&identifier=oai:bmc:1471-2105-1-1&metadataPrefix=oai_dc
12. OAI responses
XML encoded bytestreams, containing the records
Record = triplet
header (unique OAI identifier)
metadata
about
Metadata schemes
XML Schema
Minimum: unqualified Dublin Core
Community specific
Example of a record (catkey 450000 from ULB catalogue):
oai_dc marc21 umods
13. Simple : 6 OAI requests/responses
Identify
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=Identify _
ListMetadataFormats [identifier]
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListMetadataFormats _
ListSets
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListSets _
GetRecord identifier, metadataPrefix
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=marc21 _
14. Simple : 6 OAI requests/responses
ListRecords metadataPrefix, [from,until,set]
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc _
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=mhld21&set=elper _
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=marc21&from=2006-08-01 _
ListIdentifiers metadataPrefix, [from,until,set]
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListIdentifiers&metadataPrefix=oai_dc _
15. Implementation of the data provider functionality (2001)
http://www.openarchives.org/tools/tools.htmlpick a template and interface with Unicorn through Unicorn database tools
Our choice: Object Oriented Perl frontend (H. Suleman – Virginia Tech) _
17. Example: implementation of the GetRecord request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=oai_dc
1. Get metadata from Unicorn for catkey 245000
$record = `echo $catkey | catalogdump -of | filtermarc -iALL -od -Ds`; _
@dates = split(‘\|’,`echo $catkey | selcatalog -iK -opr`);
2. Convert ANSEL character set into ISO-LATIN-1
3. Map from MARC to oai_dc _
4. Format into XML
18. Example: implementation of the ‘set’ parameter of the ListRecords request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&set=elper
Precompile set as a file of catkeys
name of file: « name of set_catkeys »
einstein_albert_catkeys
elper_catkeys
sd_catkeys
all_catkeys
through periodic execution of « mkoaisets » custom report
19. Example: implementation of the ‘from/until’ parameters of the ListRecords request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=ListRecords&metadataPrefix=oai_dc&from=2006-08-01&until=2006-08-31
BRS index on creation/modification date?
Every Unicorn record that gets created or modified is ‘touched’ in the ‘textedit’ and ‘browsedit’ directories
Custom report ‘cadutext’
saves catkeys to <ud>/Savedkeys/adutext/rptid
adds line ‘rptid|date|status’ to <ud>/Lastruns/cadutext
Example: « from=2006-08-01&until=2006-08-31 »
obtain report ids for all runs of cadutext after 2006-08-01 and before 2006-08-31 from the file <ud>/Lastruns/cadutext
for each of these report ids: obtain catkeys from <ud>/Savedkeys/adutext/rptid and save them to randomnumber_catkeys file
sort and uniq the randomnumber_catkeys file
20. Limitations of implementation:
ListRecords/ListIdentifiers:
The from and until parameters are not permitted if the set parameter is given on the request
The from and until parameters are permitted if the set parameter is not given on the request, but their values should fall within a certain date range (at this moment arbitrarily set to ‘today - 2 months’ and ‘today’)
Deleted records
Complete source code and documentation available on the API Repository (http://sirsiapi.org)
23. Use case 1: Vlink - OpenURL resolver system
OpenURL sent from iLink
http://bibdev.vub.ac.be/cgi-bin/openurlulb? sid=ULB:Webcat&id=oai:ulbcat:617924
This OpenURL does not contain enough metadata for the specific item ==> Vlink does a fetch back to Unicorn through an OAI GetRecord request to obtain a full MARC21 bibliographic description
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog? verb=GetRecord&identifier=oai:ulbcat:617924&metadataPrefix=marc21
24. Use case 1: Vlink - OpenURL resolver system
Feed Vlink Knowledge Base through OAI harvesting
25. Use case 2: Unicat - Virtual Union Catalog of Belgium
27. ‘Client searches and retrieves metadata records from an archive’
Stateless protocol: sequence of SRU requests/responses over HTTP
Search and Retrieve (<-> OAI: harvesting)
28. SRU requests
HTTP GET requests
Syntax
BASE URL
host + port + path of SRU request handler
key=value pairs
3 possible requests (operations)
explain
serves to record facilities available at an SRU server
used by clients to self-configure
returned explain record is in XML and follows the ZeeRex Schema
Example: http://z3950.loc.gov:7090/voyager?version=1.1&operation=explain _
scan
allows the client to request a range of the available terms at a given point within a list of indexed terms
enables clients to present an ordered list of values and, if supported, how many hits there would be for a search on that term
searchRetrieve
29. searchRetrieve operation
searchRetrieve (principal) parameters
Version: (of the request); current protocol version: 1.1
query: query expressed in CQL
startRecord: position within the sequence of matched records of the first record to be returned
maximumRecords: number of records requested to be returned
recordSchema: schema requested for the records to be returned
stylesheet: URL for an xml stylesheet. The client requests that the server simply return this URL in the response.
CQL
« Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accomodate complex concepts when necessary. »(http://www.loc.gov/standards/sru/cql)
30. searchRetrieve operation
Examples of CQL queries:
dinosaurtitle = "complete dinosaur"title exact "the complete dinosaur"dinosaur not reptile dinosaur and bird or dinobird publicationYear < 1980
title all "complete dinosaur"
title contains all of the words: ‘complete’, and ‘dinosaur’
title any "dinosaur bird reptile"
title contains any of the words: ‘dinosaur’, ‘bird’, or ‘reptile’
ribs prox/distance<=5 chevrons
a more specific proximity query: ‘ribs’ within 5 words of ‘chevrons’
31. searchRetrieve operation -- examples
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&query=author=einstein _
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein _
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author=einstein&recordSchema=dc _
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=author all "einstein albert“ _
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“ _
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleCanevas.xsl _
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
34. SRU/Z39.50 Gateway: YAZ Proxy (Index Data)
Implemented at ULB: 7/2006 (2 days)
config.xml
<target name="cible" default="1">
<url>bib7.ulb.ac.be:2200</url>
<xi:include href="explain.xml"/>
<cql2rpn>pqf.properties</cql2rpn>
</target>
<target name=“slavko" default="1">
<url>velma.library.mun.ca:2200</url>
<xi:include href="explain.slavko.xml"/>
<cql2rpn>pqf.slavko.properties</cql2rpn>
</target>
explain.xml
ZeeRex XML record as response to ‘explain’ operation
pqf.properties
specifies the mapping of various CQL indexes, relations, etc. into Type-1 query attributes
35. YAZ Proxy
http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
http://bib49.ulb.ac.be:9000/Slavko?version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=title all "einstein albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
36. Seamless integration of catalog searches in CMS
Typo3
Example
HTML page containing biography of famous belgian historian Henri Pirenne
frame pointing to the following URL:
http://bib49.ulb.ac.be:9000/Cible? version=1.1&operation=searchRetrieve&maximumRecords=10&startRecord=1&query=pirenne%20and%20epub-dnu-*&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl
Project
Unicorn contains descriptions of databases, websites, etc with local thematic classification codes in 653
create thematic websites within our CMS, containing frames that list available databases per theme