240 likes | 245 Views
This project aims to make OAI-enabled metadata for digital objects accessible to the public. It provides a one-stop shopping platform for freely accessing digital objects of any subject matter and format, without any dead ends. The project utilizes the University of Illinois Urbana-Champaign open-source OAI protocol harvester and Java edition for Unix environment.
E N D
OAIster: Metadata Pointing to Digital Objects Kat Hagedorn Metadata Harvesting/DLXS Librarian University of Michigan Libraries February 18, 2004
background • One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public • Digital Library Production Service at University of Michigan Libraries began work in December 2001 • Launched in June 2002
highlights • Any audience • Any subject matter • Any format • Freely accessible • No dead ends • One-stop shopping …retrieving the “hidden web”
tool we borrowed • University of Illinois Urbana-Champaign open-source OAI protocol harvester • java edition for our unix environment • Worked collaboratively to iron out kinks • resumptionToken / retryAfter • inexplicable kill • bogus records in MySQL table
development environment • Digital Library Extension Service (DLXS) • Develop open-source middleware and license XPAT search engine for building and mounting digital libraries • Middleware consists of document classes, i.e., Text, Image, Bib, FindAid • Originally designed to make SGML encoded texts available online
tool we developed • Runs in DLXS environment using BibClass • Current BibClass web templates modified • Additional java-based transformation tool to: • DC metadata records concatenated • No-digital-object records filtered out • Records counted • Conversion from UTF-8 to ISO-8859-1 • XSLT used to transform DC records into BibClass records
system design XSL stylesheets (per source type) UIUC harvester XSLT transformation tool OAI-enabled DC records Record storage Non-OAI-enabled DC records Search interface (XPAT) BibClass indexes
result • One place to look for digital objects • Big • 3,016,251 metadata records • 267 institutions (as of last week…) • Popular • Averages 3300 search sessions / month • Picked up in March ‘03: average 3500 now • 43,894 searches in one year (June 2002 – July 2003)
repositories: e.g., • arXiv Eprint Archive: math and physics pre- and post-prints • Online Archive of California: manuscripts, photographs, and works of art held in institutions across California • Sammelpunkt, Elektronisch Archivierte Theorie: archive of philosophical publications • British Women Romantic Poets Project: collection of poems written by British women between 1789 and 1832
repositories: stats • As of February ‘04, out of 267 repositories… • International and U.S. • U.S.: 50.5% (135) • Intl: 49.5% (132) • By subject • Humanities: 24% (65) • Science: 30% (81) • Mixed: 46% (121) • E-prints and pre-prints • Using eprints.org software: 39% (104) • Not using eprints.org software: 61% (163)
major issues encountered • Metadata variation • Records not leading to digital objects • Access restrictions on digital objects described in records • Duplicate records for a single digital object
issue: metadata variation • With more records, users need more restrictions • Consistent metadata needed to facilitate these restrictions • One option: normalization of data
issue: metadata variation • Type: the obvious quick win • 240 metadata values mapped to four generic values (text, image, audio, video) • e.g., audio, sound = audio motion, animation, newsreels, etc. = video watercolour, watercolor, slides, etc. = image article, articles, booklet, diss, story, etc. = text
issue: metadata variation • Date: where to begin? • Most records with at least one date • Some records include up to seven dates • No consistent style of date • Subject: out of context, what meaning? • Many records with at least one subject element • But over 100 records with more than 50 subjects • And one record with 1000!
issue: metadata variation • Sample date values <date>2-12-01</date> <date>2002-01-01</date> <date>0000-00-00</date> <date>1822</date> <date>between 1827 and 1833</date> <date>18--?</date> <date>November 13, 1947</date> <date>SEP 1958</date> <date>235 bce</date> <date>Summer, 1948</date>
issue: metadata variation • Sample subject values <subject>30,51,52</subject> <subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta [Judson].</subject> <subject>Slavery--United States--Controversial literature</subject> <subject>view of interior with John Henry sculpture</subject> <subject>Particles (Nuclear physics) -- Research.</subject>
issue: no digital objects • Some records contain links to further description of digital object • But not the digital object itself • Culling difficult • One option: add explanatory text to site • Or, unfortunately, spot-check and remove repositories with this issue
issue: access restrictions • No records where metadata itself is restricted in use (as far as we know!) • Definitely some records where objects are restricted to licensed users • One option: add explanatory text to site • Or sub-set OAIster into free and “partially” free repositories
issue: duplicate records • Two records harvested, different identifiers, same object described and pointed to • Two records harvested inadvertently through aggregators and original repositories
issue: duplicate records • Need algorithm to automate de-duplication • Were duplicates to be identified, how to deal with the issue? • Suppress? • Group? • Flag? • So far, not addressed in OAIster
future of OAIster • Advanced searching • Grouping to aid browsing • Further normalization of data • Handling duplicate records • Saving/emailing/downloading records • Collaboration with other services: search, instructional… • More user testing…
current state of protocol • Popular • As Peter Suber says: • “…no other single idea or technology in the [open-source movement] has enjoyed this density of endorsement and adoption in a six month period.” • Data providers over one year: • June ‘02: 56 repositories / 274,062 records • June ‘03: 187 repositories / 1,246,953 records • Over three-fold increase for repositories • Over four-fold increase for records
future of protocol • Branching out • DC required vs. highly recommended • Use of OAI in closed environments • Static repository protocol • OAI-rights committee • OAI evangelism
contact info • Kat Hagedorn • University of Michigan Libraries, Digital Library Production Service • khage@umich.edu • http://www.oaister.org/