500 likes | 708 Views
The Basics of OAI : An Introduction to the Protocol for Metadata Harvesting. Timothy W. Cole and Sarah Shreeves University of Illinois at Urbana-Champaign Martin Halbert Emory University Pre-Conference Workshop Web-Wise 2004: Sharing Digital Resources Chicago, IL - March 3, 2004. Outline.
E N D
The Basics of OAI : An Introduction to the Protocol for Metadata Harvesting Timothy W. Cole and Sarah Shreeves University of Illinois at Urbana-Champaign Martin Halbert Emory University Pre-Conference Workshop Web-Wise 2004: Sharing Digital Resources Chicago, IL - March 3, 2004
Outline Introductions and Why We’re Here The Open Archives Initiative Protocol for Metadata Harvesting OAI-PMH Implementation Guidelines Metadata Authoring for OAI & Interoperability – Experiences from OAI Service Providers Web-Wise 2004
Introductions • Presenters: • Tim Cole (t-cole3@uiuc.edu) • Sarah Shreeves (sshreeve@uiuc.edu)http://imlsdcc.grainger.uiuc.edu/ • Martin Halbert (mhalber@emory.edu)http://www.metascholar.org/ Web-Wise 2004
Digital Collections vs. Digital Libraries • Building Good Digital Collections • The IMLS / NISO Framework (http://www.imls.gov/pubs/forumframework.htm)Focuses on Process of Creating Digital Content • Implicit Assumption: Digital Collections are the Raw Materials on which Digital Library Services are Built • Priority on Reusability, Persistence, Sustainability, Interoperability, Verification, and Documentation • OAI-PMH Enables Value-Added Digital Library Services which use Harvested Metadata Web-Wise 2004
IMLS DCC Project Foundation • Implements Recommendations of the IMLS Digital Library Forum & Framework of Guidance for Building Good Digital Collections • Recommended Creation of IMLS NLG Collection Registry • Recommended Encouraging IMLS Projects to Author Metadata for Interoperability and Implement OAI-PMH • Increase access and visibility to IMLS funded digital collections • Build infrastructure for digital library out of many digital collections Web-Wise 2004
IMLS Digital Collections and Content • Build a registry of all National Leadership Grant collections with digital content. • Assist and guide NLG projects in making item-level metadata sharable via the OAI Protocol for Metadata Harvesting. • Build a repository and search and discovery tools for integrated access to the content of NLG collections. • Research best practices for sharing metadata about diverse digital content and for supporting the interests of diverse user communities. Web-Wise 2004
Motivation to Consider OAI-PMH • Access to / Sharing of Your Content • Visibility for Your Content • Opportunity to Participate in IMLS DCC Project • Opportunity to Gain Experience / Prepare for Future Projects Web-Wise 2004
Who uses OAI? • Approximately 400 data providers • Basic building block of the National Science Digital Library (NSDL) • Incorporated into D-Space and Eprints.org • Part of ContentDM, Michigan’s DLXS, and other products • International use: Open Archives Forum in Europe, will be part of federation activities in the UK and EU Web-Wise 2004
The Open Archives Initiative Protocol for Metadata Harvesting (www.openarchives.org) Web-Wise 2004
OAI- PMH is a tool • The protocol refers to the set of rules that defines the communication between systems (like FTP and HTTP) • All about moving metadata (not data) around • Assumes widely distributed content, but centralized indexing & services • Build once, use for many applications – a building block for digital library services The purpose of OAI is to foster interoperability Web-Wise 2004
OAI is not…. • Metadata • A search tool • A database Web-Wise 2004
Brief History of OAI • Originated in the e-print archive community • Creation of interoperability tools for between archives of e-prints • Santa Fe Meetings - 1999 and 2000 • Paul Ginsparg, Rick Luce, & Herbert Von de Sompel initiators • OAI – PMH version history: • First Alpha Release, Sept. 2000 • 1.0 (Beta) Release January 2001 • 1.1 (Beta 2) Release July 2001 • 2.0 (Production) Release June 2002 Web-Wise 2004
Some Basic OAI-PMH Concepts • “Federated search” rather than “Broadcast search” • Data providers – support OAI PMH as a means to expose metadata • Service providers – ‘harvests’ metadata from data providers via the OAI-PMH • OAI-PMH based upon HTTP and XML • OAI-PMH requires use of simple Dublin Core • BUT supports and encourages use of other metadata schemas Web-Wise 2004
Federated vs. Distributed • Distributed/Broadcast searching: search and discovery over remote services and data • Federated/Harvesting is when data/metadata is transferred from the remote source to the destination where the services are located (e.g. Union catalogs) Competing – but not incompatible – approaches to interoperability Web-Wise 2004
As Compared to Z39.50 Web-Wise 2004
Why Use OAI? • Content is widely distributed, in different kinds of non-Z39.50 enabled locations • Metadata provider more lightweight than Z39.50 and scales well • Service provider wishes to augment search services or metadata normalization is needed. Data Providers can use both Z39.50 & OAI Web-Wise 2004
How OAI Works • 6 distinct ‘verbs’ or request • OAI requests are sent via HTTP • Responses are sent in valid XML Service Provider Data Provider DATABASE H A R VESTER HTTP Request (OAI Verb) REPOSITORY OAI OAI HTTP Response (Valid XML) Web-Wise 2004
How OAI Works OAI “VERBS” Identify ListMetadataFormats ListSets ListIdentifiers ListRecords GetRecord Web-Wise 2004
Identify • Purpose • Return general information about the archive and its policies (e.g., datestamp granularity) • Parameters • None • Sample URL • http://aerialphotos.grainger.uiuc.edu/oai.asp?verb=Identify Web-Wise 2004
ListSets • Purpose • Provide a listing of sets in which records may be organized (may be hierarchical, overlapping, or flat) • Parameters • None • Sample URL: http://aerialphotos.grainger.uiuc.edu/oai.asp?verb=ListSets Web-Wise 2004
ListMetadataFormats • Purpose • List metadata formats supported by the archive as well as their schema locations and namespaces • Parameters • identifier – for a specific record (O) • Sample URL http://aerialphotos.grainger.uiuc.edu/oai.asp?verb=ListMetadataFormats Web-Wise 2004
ListIdentifiers • Purpose • List headers for all items corresponding to the specified parameters • Parameters • from – start date (O) • until – end date (O) • set – set to harvest from (O) • metadataPrefix – metadata format to list identifiers for (R) • resumptionToken – flow control mechanism (X) • Sample URL http://aerialphotos.grainger.uiuc.edu/oai.asp?verb=ListIdentifiers&metadataPrefix=oai_dc Web-Wise 2004
GetRecord • Purpose • Returns the metadata for a single item in the form of an OAI record • Parameters • identifier – unique id for item (R) • metadataPrefix – metadata format for the record (R) • Sample URL • http://aerialphotos.grainger.uiuc.edu/oai.asp?verb=GetRecord&identifier=oai:aerialphotos.grainger.uiuc.edu:AP-1A-1-1940&metadataPrefix=oai_dc Web-Wise 2004
ListRecords • Purpose • Retrieves metadata records for multiple items • Parameters • from – start date (O) • until – end date (O) • set – set to harvest from (O) • resumptionToken – flow control mechanism (X) • metadataPrefix – metadata format (R) • Sample URL http://aerialphotos.grainger.uiuc.edu/oai.asp?verb=ListRecords&metadataPrefix=oai_dc Web-Wise 2004
Unique Identifiers • Each OAI item must have a unique identifier • Identifiers must follow rules for valid URIs • Example: • oai:<archiveId>:<recordId> • oai:etd.vt.edu:etd-1234567890 • Each identifier must resolve to a single item and always to the same item • Can’t reuse OAI item identifiers Web-Wise 2004
Datestamps • Needed for every OAI record to support incremental harvesting • Must be updated when addition or modification or deletion made in order to ensure changes are correctly propagated to harvesters • Different from dates within the metadata – OAI datestamp is used only for harvesting • Can be either YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ (must be GMT timezone) Web-Wise 2004
OAI Items vs. OAI Records • An OAI ITEM is the complete set of metadata you possess describing an object in your repository • Items exist only in OAI Metadata Provider database • An OAI RECORD is an OAI Item disseminated in a particular metadata format – e.g., DC or MARC • Records are what get harvested by OAI Service Providers • OAI IDENTIFIERS are Item-Level • OAI DATESTAMPS are Record-Level Web-Wise 2004
An OAI Record <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2002-02-28</datestamp> <setSpec>cs</setSpec> </header> <metadata> <oai_dc:dc xmlns…> <dc:title>Using Structural Metadata…</dc:title> … </oai_dc:dc> </metadata> <about> <provenance xmlns…> …. </provenance> </about> Web-Wise 2004
Other Pieces of OAI • Flow Control • Sets • Multiple metadata schemas Web-Wise 2004
Break – 15 minutes Web-Wise 2004
Implementing OAI-PMH • Technical Approaches • Resources for OAI Metadata Providers • OAI Implementation Guidelines Web-Wise 2004
Option 1 – Database Based System • Good option for collections • Actively adding metadata to their collection • With a large collection of metadata (over 5000 records) • Requirements: • Metadata • Database application (e.g. MySQL, Oracle, MS Access, MS SQL) • Web server with CGI capability (e.g. Apache/Tomcat, MS IIS) • Validating, transforming XML parser (e.g. Xerces, Sun’s JavaXMLPack, MSXML) Web-Wise 2004
Option 2 – File Based System • Good option for collections • Actively adding metadata to their collection • With a large collection of metadata (over 5000 records) • Requirements • Metadata in XML or available for IMLS DCC to put into XML • Web server with CGI capability (e.g. Apache/Tomcat, MS IIS) • Validating, transforming XML parser (e.g. Xerces, Sun’s JavaXMLPack, MSXML) Web-Wise 2004
Option 3 – Static Repository • Good option for collections: • No longer adding metadata to their collection • With small collections (fewer than 5000 records) • Requirements: • Metadata in XML. (IMLS DCC will help with conversions.) • Available space on a web server for posting static XML files Web-Wise 2004
Open Source OAI Tools • Open Archives Initiative Tools • http://www.openarchives.org/tools/tools.html • University of Illinois OAI Tools • http://uilib-oai.sourceforge.net/ • OAI tools on Sourceforge • http://www.sourceforge.net and search for OAI in the Software/Groups category Web-Wise 2004
Commercial and open source turnkey solutions • ContentDM • http://contentdm.com/ • Univ. of Michigan DLXS XPat • http://www.dlxs.org/ • D-Space • http://www.dspace.org/ • Endeavor Encompass (forthcoming) • http://encompass.endinfosys.com/ Web-Wise 2004
Resources for data providers • OAI for beginners tutorial • http://www.oaforum.org/tutorial/ • Repository Explorer • http://purl.org/net/oai_explorer • XML Schema Validator • http://www.w3.org/2001/03/webdata/xsv • XML Tools at W3C • http://www.w3.org/XML/#software Web-Wise 2004
Registering Your OAI Provider • Register with the Official OAI Registry http://www.openarchives.org/data/registerasprovider.html • The UIUC Experimental OAI Registry http://gita.grainger.uiuc.edu/registry/ • Test Before You Register • Registry Explorer @ Virginia Tech • Email us (sshreeve@uiuc.edu) for a Test Harvest Web-Wise 2004
OAI Implementation Guidelines http://www.openarchives.org/OAI/2.0/guidelines.htm • Includes: • Guidelines for Repository Implementers • Guidelines for Harvester Implementers • Guidelines for Aggregators, Caches and Proxies • Specification for an OAI Static Repository… • Community-Specific Guidelines (OLAC, EPrints) Web-Wise 2004
Metadata Authoring for OAI • Lessons Learned from Metascholar projects at Emory • Lessons learned from UIUC’s initial OAI harvesting project Web-Wise 2004
UIUC – Lessons Learned Metadata aggregation challenges • Heterogeneous resources from multiple communities • Element usage practices • Granularity of description • Diverse vocabularies Web-Wise 2004
UIUC – Lessons Learned Challenge: Heterogeneity of content & providers • Metadata describing digital and analog items – including images, texts, web pages, physical objects, finding aids, etc. • Knowledge structures – ontologies different • Perspectives on use and presentation of digital resources different Web-Wise 2004
UIUC – Lessons Learned Challenge: Variations in use of Dublin Core Web-Wise 2004
Description:Digital image of a single-sized cotton coverlet for a bed with embroidered butterfly design. Handmade by Anna F. Ginsberg Hayutin. Source:Materials: cotton and embroidery floss. Dimensions: 71 in. x 86 in. Markings: top right hand corner has 1 1/2 in. x 1/2 in. label cut outs at upper left and right hand side for head board; fabric is woven in a variation of a rib weave; color each of yellow and gray; hand-embroidered cotton butterflies and flowers from two shades of each color of embroidery floss - blue, pink, green and purple and single top 20 in. bordered with blue and black cotton embroidery thread; stitches used for embroidery: running stitch, chain stitch, French knot and back stitches; selvage edges left unfinished; lower edges turned under and finished with large gray running stitches made with embroidery floss. Format:Epson Expression 836 XL Scanner with Adobe Photoshop version 5.5; 300 dpi; 21-53K bytes. Available via the World Wide Web. Coverage:— Date Created: 2001-09-19 09:45:18; Updated: 20011107162451; Created: 2001-04-05; Created: 1912-1920? Type:Image UIUC – Lessons Learned Excerpt of Metadata Record Describing "Cotton coverlet with embroidered butterfly design" Web-Wise 2004
UIUC – Lessons Learned Excerpt of Metadata Record Describing “American Woven Coverlet” Description:Materials: Textile--Multi, Pigment—Dye; Manufacturing Process: Weaving--Hand, Spinning, Dyeing, Hand-loomed blue wool and white linen coverlet, worked in overshot weave in plain geometric variant of a checkerboard pattern.Coverlet is constructed from finely spun, indigo-dyed wool and undyed linen, woven with considerable skill. Although the pattern is simpler, the overall craftsmanship is higher than 1934.01.0094A. - D. Schrishuhn, 11/19/99 This coverlet is an example of early "overshot" weaving construction, probably dating to the 1820's and is not attributable to any particular weaver. -- Georgette Meredith, 10/9/1973 Source:— Format:228 x 169 x 1.2 cm (1,629 g) Coverage:Euro-American; America, North; United States; Indiana? Illinois? Date:Early 19th c. CE Type:cultural; physical object; original Web-Wise 2004
UIUC – Lessons Learned Challenge: Range of vocabularies in use Controlled Vocabularies in use for IMLS NLG projects (results from survey of 65 NLG projects with digital content) Web-Wise 2004
Meeting the challenge – Data Providers Data providers can: • Create metadata for interoperability • Reusable metadata - Think beyond your local users and environment • Use well structured and defined schemas • Use and identify controlled vocabularies • Use Sets Web-Wise 2004
Meeting the challenge – Service Providers Service providers can: • Analyze metadata and cluster and normalize some aspects • Build indexes based on type of resource (image, text, physical object) rather than collection • Custom interfaces and selective views for target audiences / domains Web-Wise 2004
Recap OAI is a tool to facilitate interoperability OAI is easy - metadata is hard Better metadata = better interoperability Web-Wise 2004
Contact Information Tim Cole PI, IMLS Digital Collections and Content University of Illinois Library at Urbana-Champaign Email: t-cole3@uiuc.edu Sarah Shreeves Project Coordinator, IMLS Digital Collections and Content University of Illinois Library at Urbana-Champaign Email: sshreeve@uiuc.edu Martin Halbert Director for Library Systems Emory University Email: mhalber@emory.edu Web-Wise 2004