700 likes | 860 Views
Institutional Archives Technology Overview Michael L. Nelson Old Dominion University mln@cs.odu.edu http://www.cs.odu.edu/~mln/. Institutional Archives & Repositories: What this digital movement means for Federal Libraries Library of Congress Workshop September 12, 2003. Acknowledgements.
E N D
Institutional Archives Technology OverviewMichael L. NelsonOld Dominion Universitymln@cs.odu.eduhttp://www.cs.odu.edu/~mln/ Institutional Archives & Repositories: What this digital movement means for Federal Libraries Library of Congress Workshop September 12, 2003
Acknowledgements • ODU: K. Maly, M. Zubair, J. Bollen • LANL: R. Luce, X. Liu • NASA: G. Roncaglia, J. Rocker • Cornell: C. Lagoze, S. Warner • MAGiC (UK): Paul Needham • and, of course, Herbert Van de Sompel (LANL) • the OpenURL slides are nicked from his presentations
Outline • A bit of history • Core technologies • OAI-PMH • OpenURL • Example implementations • Download and go…
Background • I met Herbert Van de Sompel in April 1999... • we spoke of a demonstration project he had in mind and had received sponsorship from Paul Ginsparg and Rick Luce • We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc. • most digital libraries (DLs) had grown up along single disciplines or institutions • little to no interoperability; isolated DL “gardens”
Universal Preprint Service • A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives • Nelson: NCSTRL+; a modified version of Dienst • support for “clustering” • support for “buckets” • Krichel: ReDIF metadata format • Van de Sompel: SFX Linking • Demonstrated at Santa Fe NM, October 21-22, 1999 • http://web.archive.org/web/*/http://ups.cs.odu.edu/ • D-Lib Magazine, 6(2) 2000 (2 articles) • http://www.dlib.org/dlib/february00/02contents.html • UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/
Data and Service Providers • Self-describing archives • Much of the learning about the constituent UPS archives occurred out of band… • Data Providers • publishing into an archive • providing methods for metadata “harvesting” • provide non-technical context for sharing information also • Service Providers • harvest metadata from providers • implement user interface to data Even if these are done by the same DL, these are distinct roles
Metadata Harvesting • Move away from distributed searching • Extract metadata from various sources • Build services on local copies of metadata • data remains at remote repositories all searching, browsing, etc. performed on the metadata here user individual nodes can still support direct user interaction search for “cfd applications” local copy of metadata metadata harvested offline metadata harvested offline metadata harvested offline metadata harvested offline each node independently maintained . . .
Result… OAI • The OAI was the result of the demonstration and discussion during the Santa Fe meeting • OAI = a bunch of people, a religion, a cult, etc. • OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained by the OAI • Initial focus was on federating collections of scholarly e-print materials… • …however, interest grew and the scope and application of OAI-PMH expanded to become a generic bulk metadata transport protocol • Note: • OAI-PMH is only about metadata -- not full text! • but what is metadata vs. full-text? • OAI is neutral with respect to the nature of the metadata or the resources the metadata describes • read: commercial publishers have an interest in OAI-PMH too...
The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management still applies!) Archive defined as a “collection of stuff” -- not the archivist’s definition of “archive”. “Repository” used in most OAI documents. TLA; needed another vowel... Open Archives Initiative
Request is encoded in http OAI-PMH Mechanics Response is encoded in XML XML Schema for the responses are defined in the OAI-PMH document
Overview of OAI-PMH Verbs archival metadata harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)
set-membership is item-level property resource all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records OAI-PMH Data Model item = identifier record = identifier + metadata format + datestamp
service providers (harvesters) data providers (repositories) Data Providers / Service Providers
Aggregators • aggregators allow for: • scalability for OAI-PMH • load balancing • community building • discovery service providers (harvesters) data providers (repositories) aggregator
Aggregators • Frequently interchangeable terms: • aggregators: likely to be community / institutionally focused • caches: stores a copy, less likely to be community-oriented • proxies: less likely to store a copy, may gateway between OAI-PMH and other protocols • Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03 • To learn more about aggregators, caches & proxies: • http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm • http://www.cs.odu.edu/~mln/jcdl03/
Example Aggregators • Arc - http://arc.cs.odu.edu/ • first described “hierarchical harvesting” in D-Lib Magazine, 7(4) 2001 • http://www.dlib.org/dlib/april01/liu/04liu.html • Celestial - http://celestial.eprints.org/ • among other services, it provides a history of harvests (successful vs. errors) • http://celestial.eprints.org/cgi-bin/status
OAI-PMH 2.0 Registration • unregistered because: • testing / development • not for public harvesting • public, but “low-profile” • never got around to it… • ??? ??? unregistered repositories 75 repositories registered DP:SP ~= 5:1 Data Providers: http://www.openarchives.org/Register/BrowseSites.pl Service Providers: http://www.openarchives.org/service/listproviders.html
Registration is Nice……But Not Required • OAI-PMH is (becoming) the “http” for digital libraries • there is no central registry of http servers • remember the NCSA “What’s New” page? (ca. 1994) • There will never be “registration support” in OAI-PMH • registries are a type of service provider, built on top of OAI-PMH • registration will be an integral part of community building • friends…
harvester Identify <friends>…</friends> http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/ http://ston.jsc.nasa.gov/collections/TRS/oai/ http://ntrs.nasa.gov/oai2.0/ http://horus.riacs.edu/perl/oai/ NASA<friends>example
Field of Dreams • It should be easy to be a data provider, even if it makes more work for the service provider. • if enough data providers exist, the service providers will come (DPs >> SPs) • Open-source / freely available tools • “drop-in” data providers • at the end of this presentation • tools to make your existing DL a data provider: • http://www.openarchives.org/tools/tools.htm • also: OAI-implementers mailing list / mail archive! • service providers: • http://oaiarc.sourceforge.net/
OAI Open Day, Washington DC 1/2001 2nd OAI Workshop CERN 10/2002 4 5 0 1 11 4 6 1 Protocol definition, development tools DPs, retrofitting existing DLs SPs, new services Socio-Economic- Political Issues OAI-PMH Meeting History
Shift of Topics • From the protocol itself, supporting & debugging tools and how to retrofit (existing) DLs… • …to building (new) services that use the OAI-PMH as a core technology and reporting on their impact to the institution/community
Arc • http://arc.cs.odu.edu/ • harvests all known archives • first end-user service provider • source available through SourceForge • hierarchical harvesting
NCSTRL • http://www.ncstrl.org/ • metadata harvesting replacement for Dienst-based NCSTRL • based on Arc • computer science metadata
Archon • http://archon.cs.odu.edu/ • physics metadata • based on Arc • features: • citation indexing • equation-based searching
Torii • http://torii.sissa.it/ • physics metadata • features • personalization • recommendations • WAP access
iCite • http://icite.sissa.it/ • physics metadata • features • citation based access to arXiv metadata
my.OAI • http://www.myoai.com/ • covers all registered metadata • features • result sets • personalization • many other advanced features
Cyclades • http://www.ercim.org/cyclades • scientific metadata • features • personalization • recommendations • collaboration • status?
citebase • http://citebase.eprints.org/ • arXiv metadata • citation based indexing, reporting
OAIster • http://oaister.umdl.umich.edu/ • harvests all known archives
Others… • Commercial publishers • American Physical Society (APS) • Institute of Physics • Elsevier / Scirus (www.scirus.com) • Department of Energy • OSTI • LANL • Institutional servers • DSpace (MIT; www.dspace.org) • Eprints (www.eprints.org) • DARE (All Dutch universities)
NACA Technical Report Server • publicly available • began in 1996 • details in NASA TM-1999-209127 • scanned reports from 1917-1958 • NACA = predecessor to NASA • contents mirrored with the MaGIC project • a UK-based grey-literature preservation project • OAI-PMH used to mirror contents http://naca.larc.nasa.gov/ http://naca.larc.nasa.gov/oai2.0/
NACA Report 1345 as seen through its native DL http://naca.larc.nasa.gov/
NACA Report 1345 as seen through MAGiC http://www.magic.ac.uk/
NACA Report 1345 as seen through its Scirus (Elsevier) http://www.scirus.com/
NACA Report 1345 as seen through my.OAI (FS Consulting) http://www.myoai.com/
NASA Technical Report Server • replacement for the previous distributed searching version of NTRS • MySQL • Va Tech harvester • modified “bucket” • details in Nelson, Rocker, Harrison, Library Hi-Tech, 21(2) (March 2003) • a service provider & aggregator • same OAI baseURL as used for interactive searching http://ntrs.nasa.gov/
NASA Technical Report Server • advanced, fielded search • explicit query routing • 10 NASA repositories • 4 non-NASA repositories • turned “off” by default
non-NASA repositories > 0.5M records
NTRS … CASITRS LTRS ATRS NASA DLs in the Larger STI Realm DOE Publishers Universities DOD International . . . this could be a fully connected graph NTRS could also be a data provider from the point of view of other DLs; allowing the harvesting of NASA report metadata. NTRS could also harvest metadata from other DLs, and provide access to non-NASA content. We hope to influence the direction of the science.gov effort to use OAI-PMH
Service Providers • It is clear that SPs are proliferating, despite (because of?) the inherent bias toward DPs in the protocol • easy to be a DP -> many DPs -> SPs eventually emerge • hard to be a DP -> SPs starve • currently 5x DPs more than SPs • SPs are beginning to offer increasingly sophisticated services • competitive market originally envisioned for SPs is emerging
Origins & Motivation • The Context: Library Automation Environment anno 1998 • distributed information environment • local & remote A&I databases • rapidly growing e-journal collection • need to interlink the available information • The Problem: • links are delivered by info providers • links are not sensitive to user’s context • appropriate copy problem • links dependent on business agreements between information vendors • links don’t cover the complete collection
Origins & Motivation • The Context: Library Automation Environment anno 1998 • distributed information environment • local & remote A&I databases • rapidly growing e-journal collection • need to interlink the available information • The REAL Problem: • libraries have no say in linking • libraries are losing core part of the “organizing information” task • expensive collection is not used optimally • users are not well served
Origins & Motivation • The Solution: • In information services: • DO NOT provide a link which is an actual service related to a referenced item (e.g. a link from a record in an A&I database to the corresponding full-text) • BUT rather provide • a link that transports metadata about the referenced item • to • others that are better placed to provide service links OpenURL Linking server operated by library
link source link destination link non-OpenURL linking resource resource . link to referenced work reference resolution of metadata into link
link link link link link destination link destination link destination link destination OpenURL link source linking server OpenURL OpenURL linking transportation of metadata & identifiers user-specific . reference context-sensitive resolution of metadata & identifiers into services provision of OpenURL
Evolution ~ 1998 • Nature of solution determined • Experiment with local databases at Ghent University • Demonstrated October 1998 at Belgian Library meeting • Problem statement & Experiment described in 2 D-Lib Magazine papers, April 1999