190 likes | 300 Views
The Open Archives Initiative Protocol for Metadata Harvesting CRIS + Open Access = The Route to Research Knowledge on the GRID Brussels – 21 September 2004 Andy Powell, UKOLN, University of Bath a.powell@ukoln.ac.uk. UKOLN is supported by:. www.bath.ac.uk. www.ukoln.ac.uk.
E N D
The Open Archives Initiative Protocol for Metadata Harvesting CRIS + Open Access = The Route to Research Knowledge on the GRID Brussels – 21 September 2004 Andy Powell, UKOLN, University of Bath a.powell@ukoln.ac.uk UKOLN is supported by: www.bath.ac.uk www.ukoln.ac.uk A centre of expertise in digital information management
Contents • a brief history of OAI • 10 technical things you should know about the OAI-PMH • potential impact… • institutional context • the role of the library? • the researcher • current activities/issues • OAI and the semantic Web note: primary focus is on the technology
OAI roots • the roots of OAI lie in the development of eprint archives… • arXiv, CogPrints, NACA (NASA), RePEc, NDLTD, NCSTRL • each offered Web interface for deposit of articles and for end-user searches • difficult for end-users to work across archives without having to learn multiple different interfaces • recognised need for single search interface to all archives • Universal Pre-print Service (UPS)
Searching vs. harvesting • two possible approaches to building a single search interface to multiple eprint archives… • cross-searching multiple archives based on protocol like Z39.50 • harvesting metadata into one or more ‘central’ services – bulk move data to the user-interface • US digital library experience in this area indicated that cross-searching not preferred approach • distributed searching of N nodes viable, but only for small values of N
Harvesting requirements • in order that harvesting approach can work there need to be agreements about… • transport protocols – HTTP vs. FTP vs. … • metadata formats – DC vs. MARC vs. … • quality assurance – mandatory elements, mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice • intellectual property and usage rights – who can do what with the records • work in this area resulted in the “Santa Fe Convention”
Development of OAI-PMH • 2 year metamorphosis thru various names • Santa Fe Convention, OAI-PMH versions 1.0, 1.1… • OAI Protocol for Metadata Harvesting 2.0 • development steered by international technical committee • inter-version stability helped developer confidence • move from focus on eprints to more generic protocol • move from OAI-specific metadata schema to mandatory support for DC
Bluffer’s guide to OAI http://www.openarchives.org/ • OAI-PMH short for Open Archives Initiative Protocol for Metadata Harvesting • a low-cost mechanism for harvesting metadata records • from ‘data providers’ to ‘service providers’ • allows ‘service provider’ to say ‘give me some or all of your metadata records’ • where ‘some’ is based on date-stamps, sets, metadata formats • eprint heritage but widely deployed • images, museum artefacts, learning objects, …
Bluffer’s guide to OAI • based on HTTP and XML • simple, Web-friendly, fast deployment • OAI-PMH is not a search protocol • but use can underpin search-based services based on Z39.50 or SRW or SOAP or… • OAI-PMH typically carries metadata • content (e.g. full-text or image) made available separately – typically at URL in metadata • mandates simple DC as record format • but extensible to any XML format – IEEE LOM, ONIX, MARC, METS, MPEG-21, etc.
Bluffer’s guide to OAI • metadata and ‘content’ often made freely available – but not a requirement • OAI-PMH can be used between closed groups • or, can make metadata available but restrict access to content in some way • underlying HTTP protocol provides • access control – e.g. HTTP BASIC • compression mechanisms (for improving performance of harvesters) • could, in theory, also provide encryption if required
Dublin Core http://dublincore.org/ • OAI-PMH mandates use of simple DC as lowest common denominator • agreed XML schema – ‘oai_dc’ • simple DC – 15 metadata properties • all DC properties optional and repeatable
OAI and Google OAI gatewaymakes harvested metadata available to Google… eprint archive(s) OAI-PMH OAI gateway HTTP Examples… Dspace and Google OAIster and Yahoo
Impact on institutions… • OAI-PMH technology provides an open, relatively stable technical framework • allows institution to re-consider management of intellectual output • greater confidence in availability of external services (e.g. discovery, access, analysis) • the technical bit is easy • eprints.org software (Southampton), DSpace (MIT/HP), Fedora • but, technical solutions are always easy! • real problem is cultural change required to get academics to deposit
Impact on libraries… • library is natural choice as ‘managing agent’ for the institutional repository • quality control • metadata enhancement • preservation • but libraries often weak technically (not always!) therefore technical collaboration within institution may be required • beginning to see some evidence of externally ‘hosted’ repository services being offered
Impact on researchers… • OAI-PMH technology provides a ‘disruptive’ technical framework that supports • new ways for individual researcher to disclose his/her research output • development of new kinds of ‘research’ discovery services • can use ‘personal’ OAI repository • but, need to • clarify roles of institutional, discipline and personal repositories • overcome FUD – IPR, peer-review, ability to ‘publish’, quality control, inertia
Current activities/issues • protocol now stable and few changes being discussed • some lightweight noises about re-implementing OAI-PMH using SOAP (Web services) but little enthusiasm for pushing these kinds of changes forward • some work on OAI-rights issues – formalising mechanisms for attaching IPR statements and/or licences to the records being exchanged using the protocol, e.g. Creative Commons
Creative Commons http://www.creativecommons.org/ • CC is “devoted to expanding the range of creative work available for others to build upon and share” • provides ‘standard’ licences for content • attribution • noncommercial • no derivative works • share alike • mechanisms for indicating licence on Web pages
Works vs. manifestations • implementers have tended to see ‘eprints’ as single-entity objects • some evidence that this is too simplistic • some repositories expose metadata about the ‘work’, others expose metadata about the ‘expressions’ • need more consistency in our use the OAI-PMH to expose metadata about both ‘works’ and ‘manifestations’ • complex objects encoded using METS or MPEG-21 DIDL (may include ‘objects’ as well as ‘metadata about objects’)
OAI and the SW • most metadata carried by the protocol currently is not RDF • not suitable for processing directly by semantic Web applications • need to build ‘knowledge’ about the structure of the metadata formats in use into the harvesting application • but could use the protocol to carry RDF/XML