210 likes | 450 Views
OAIster: What’s with the Weird Name?. Kat Hagedorn UM Library Information Technology November 28, 2005. What is OAIster?. Is/was a means for UM to test the OAI protocol… (hence the name) A method for sharing metadata among institutions and groups of people
E N D
OAIster: What’s with the Weird Name? Kat Hagedorn UM Library Information Technology November 28, 2005
What is OAIster? • Is/was a means for UM to test the OAI protocol… (hence the name) • A method for sharing metadata among institutions and groups of people • A means of developing a search service for end-users worldwide
What does OAIster collect? • Harvests all metadata from all OAI data providers (within reason) • Only keeps metadata that points to digital objects, e.g., articles, photographs, datasets, etc. in digitized form • All available via search service…
Searching OAIster • Time to show off OAIster… • http://www.oaister.org/
A little history • Service is now 3.5 years old • Started with 66 data providers and a little over 200K records • Now have 572 data providers and “a little” over 6 million records • 37% US, 63% international
Visibility of OAI • Surprising who hasn’t made their metadata shareable through OAI • Harvard, Yale, Stanford…the big ones • Initially perplexing, but now clearer: • always done at the end • only recently thought of at initiation of projects • truthfully, many institutions not collaborative…
Examples of data providers • Many data providers are huge, e.g., • arXiv: physics preprint and postprint articles • pubmed: medical articles, although restricted • pictureaustralia: images from govt and academic institutions in Australia • lcoa: Library of Congress digital archives • usc: U South California census data
Examples of data providers • Most are small, though • Many around 100 records • Value of making their records available • increased visibility • inclusion in bigger search service than theirs • incorporation in Yahoo! Search
Yahoo! Search • Two years ago, collaborated with team at Yahoo! Search to send our metadata to them for indexing • e.g., “gardens at albury” in Yahoo! Search • know it’s not static html roboting • <dc:relation>IspartOf Victorian Railways collection.</dc:relation> • Many, many more hits • Also send metadata to Google
System design XSL stylesheets (per source type) UM harvester XSLT transformation tool OAI-enabled DC records Record storage Non-OAI-enabled DC records Search interface (XPAT) BibClass indexes
Transformation of metadata • Most metadata needs to be brushed off • adding an http:// to the front of URLs • Or raked • removing instances of <![CDATA[ • Or wrung out • instead of “Where’s Waldo,” it’s “Where’s the incorrect UTF-8 character?” • And should be normalized…
Why normalize? • Sample date values <date>2-12-01</date> <date>2002-01-01</date> <date>0000-00-00</date> <date>1822</date> <date>between 1827 and 1833</date> <date>18--?</date> <date>November 13, 1947</date> <date>SEP 1958</date> <date>235 bce</date> <date>Summer, 1948</date>
Why use a CV? • Sample subject values <subject>30,51,52</subject> <subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta [Judson].</subject> <subject>Slavery--United States--Controversial literature</subject> <subject>view of interior with John Henry sculpture</subject> <subject>Particles (Nuclear physics) -- Research.</subject>
Best practices • Fixing more than half of the data providers is cumbersome • Individuals at OAI-enabled institutions started a “Best Practices” group to inform data providers what they ought to do • http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?TableOfContents
2nd phase OAI • “Best Practices” group sponsored by the Digital Library Federation, which also… • Sponsors our latest grant • Better and more easily calculated statistics • Search interface improvements • Clustering / classification techniques • Using richer metadata
Clustering / classification • Using automated means to take a selection of metadata and determine “what it’s about” • Working with Emory University (one of our grant partners) to test their tool • Results will be integrated into search so can search in smaller group of OAIster records
Using richer metadata • Data providers must use simple Dublin Core • Very sparse schema for describing objects • dc:title must contain main title, sorted title and alternative titles • dc:subject doesn’t distinguish between geographical, hierarchical, temporal…
Using richer metadata • Encouraging use of richer metadata, especially MODS (Metadata Object Description Schema) from LOC • Developed testbed for grant deliverables • currently only shows MODS work… • http://www.hti.umich.edu/m/mods/
Other stuff • Well, make it smaller somehow… • Clean up Boolean interface • squinch fields together • include more normalization • Make it available through federated search • Proselytize sharing metadata • Test, test, test
Contact me • Kat Hagedorn • UM Library Information Technology • khage@umich.edu • www.oaister.org