480 likes | 513 Views
Explore innovative ways to meet user demands efficiently by leveraging advanced technologies and data management strategies in library services. Discover current projects and research efforts aimed at improving data quality and organization for a better user experience.
E N D
New approaches to the catalog T. Hickey http://errol.oclc.org/laf/n82-54463.html Svensk Biblioteksförening 2005 October 28
OCLC • Founded 1967 • Nonprofit membership organization • > 53,000 libraries • 96 countries • ~1,000 employees • Cataloging • Interlibrary Loan • Preservation • Dewey Decimal Classification • netLibrary • FirstSearch
OCLC Research • Research for both • OCLC services • Membership • Metadata management • Knowledge organization • Content management • Interoperability • Systems & interaction design • ~30 employees
What do users want? • The right information • with minimum effort
How to give them what they want • Catch them where they are • Increase our data • Improve our data • Make the data work harder • Interconnect with other systems • Do all this efficiently
What has changed • Computers and telecommunications • User expectations • Digital materials • Remoteness of our users • Huge amounts of bandwidth, storage
The competition • Online booksellers • Reviews • Tables of contents • Excerpts • Inside-the-book searching • Web search engines • Speed • Full-text searching • Global coverage (of web resources) • Good enough • Ourselves • Electronic journals
Live search Registries, PURLs Dewey browser Harvesting, electronic theses VIAF, LAF SRU/W, OpenURLs, OAI FRBR, xISBN Beowulf cluster Map-reduce Text searching Batch loading Open WorldCat WorldCat Wiki Publisher Names MXG Current projects (my group)
Other Research Projects • FictionFinder, Curiouser • Schema Transformation • Terminology Services • Digital Preservation • Collection Analysis • Dublin Core • FAST • User Studies • Data mining • Also: http://www.oclc.org/research/researchworks/
Catch them where they are • Google, Yahoo, etc. • Open WorldCat • Open URL • OAI-PMH • Creation too • WCat Wiki • Tags?
OpenURL • OpenURL registry • Supports version 1.0 • Also registry of OpenURL servers • Used for WikiD
WorldCat ‘Wiki’ • Opening up WorldCat to user annotations • Reviews • Notes • Tables of contents • Cover art? • Book lists? • Based on WikiD software • Full Wiki • Many features off for WorldCat • Uses OpenURL 1.0 protocol internally • Allows collections of pages of arbitrary XML schemas • Tools for the creation of simple collections • Doesn’t look like a Wiki
Tags? • Folksonomies? • User-generated key words • We’ve been here before • Is it different? • Is there another direction?
More data • Harvesting • OAI-PMH • ETDs • Batch load • 60 million records • 3 million new manifestations • Other • Cover art • Reviews • WC
Better data and organization • VIAF • FRBR • Authority files in general • LAF • Publisher names • Genre • FAST • Registries • PURLs • Generalized solution? Get them nearer to creation
FRBR • Work-set algorithm • Keys based on author/title • Authority files • Auxiliary authority files • xISBN • Used for • xISBN • Open WorldCat • FirstSearch (coming) • Collection analysis (coming) • Research
Authority Files • LAF • http://errol.oclc.org/laf/n82-54463.html • Publisher names • Not normally controlled • Looking for variations with ISBN prefixes • Also worked with dissertations
VIAF • Merge national-level files • Library of Congress (NACO) and Die Deutsche Bibliothek • Bibliographic records analyzed • 15% would be erroneous based just on names • Basic matching now completed • 435,000 matching names • < 1% mismatched • Working on • Public interface • OAI harvesting • Persistent identifiers
Registries • Show relationships between metadata • Often associated with an identifier • General solution? • Examples • Authority files • WorldCat • PURLs
PURLs • Persistent URLs • Map one URL to another • http://purl.org/hickey/outgoing -> • http://outgoing.typepad.com/ • 500,000+ PURLs • 111 million resolutions • Port to Wiki’D platform? • http://www.oclc.org/research/projects/wikid/ • String of PURL servers? • Use OAI-PMH for synchronization • Spread responsibility • Generalized solution?
More connectivity • Open URL • RSS feeds • OpenSearch, SRU/W • OAI-PMH
OpenURL • Developed to address the ‘appropriate copy’ problem • Transitioning to OpenURL 1.0 • OpenURL resolver • Accepts requests specifying • Resource • Services • Generalized syntax • Specifying a resource • Services to be performed • Metadata elements specified in registry • http://purl.org/openurl/
SRU • Simplified version of Z39.50 • Web based • SRW – SOAP • SRU – URL • Even simpler? • OpenSearch • No search syntax • Looking for common ground • MXG • Metasearch XML Gateway • Simplifies metasearcher’s lives
OAI-PMH • Method of harvesting metadata • More generally, a way of synchronizing databases • No real restriction to metadata • Becomes a repository protocol • Identifiers • Timestamps • Layered implementation • OAI • SRU • Pears
Efficient processing • Beowulf cluster • Map reduce • Text searching
Beowulf Cluster • 24 nodes • 2 processors, 4 gigabytes of RAM, 120 gigabytes disk • Gigabit network • Use it for • FRBR processing • Text indexing • Text searching • ~ 30-fold speed up on many tasks • 1 year ⇒2 weeks • 1 week ⇒ 1 day • 1 day ⇒ 1 hour • 1 hour ⇒ 2 minutes • Extremely cheap processing
Map reduce • Pioneered by Google • Petabytes of data on thousands of nodes • Adapted to our cluster • Tens of gigabytes of data on dozens of nodes • Simple functional programming paradigm • Allows batch processing across cluster
Text Searching • Spread database across cluster • Two levels of aggregation • 3 servers/node • 24-way aggregation • Aggregators run across cluster • SRU used • HTTP based • SRW (SOAP) slowed it down • Open source software
Better interfaces • More interactive • Live search • Dewey Browser • Better connected
Post-coordination of Services • Systems that expose low level services • Higher level coordination of those services • Loosely coupled services • Examples from OCLC • Validation service • RSS feeds • SRU • OpenURL, OAI-PMH • xISBN • DDC Browser built this way • Very different interfaces have been built
DDC Browser XML • <?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="/ddcbrowser/xsl/wcat.xsl" ?> • <cells> • <language>swe</language> • <cell ddc="330" count="23" /> • <cell ddc="331" count="28" /> • <cell ddc="332" count="5" /> • <cell ddc="333" count="7" /> • <cell ddc="334" count="2" /> • <cell ddc="335" count="1" /> • <cell ddc="336" count="3" /> • <cell ddc="337" count="2" /> • <cell ddc="338" count="26" /> • <cell ddc="339" count="5" /> • </cells>
Do We Need It? • Just have Google harvest everything • Our experience with Google • Fielded searching • Reliable searching • Possibility of user-supplied metadata • Cost of good metadata • Cost of non-existent metadata
Conclusions • Shift to remote users • Online availability – trend towards centralization • More flexibility in implementations • Patrons are better served • Less emphasis on physical collections
Thank you T. Hickey http://errol.oclc.org/laf/n82-54463.html Swedish Library Association 2005 October 28