480 likes | 565 Views
New approaches to the catalog. T. Hickey http://errol.oclc.org/laf/n82-54463.html Svensk Biblioteksförening 2005 October 28. OCLC. Founded 1967 Nonprofit membership organization > 53,000 libraries 96 countries ~1,000 employees Cataloging Interlibrary Loan Preservation
E N D
New approaches to the catalog T. Hickey http://errol.oclc.org/laf/n82-54463.html Svensk Biblioteksförening 2005 October 28
OCLC • Founded 1967 • Nonprofit membership organization • > 53,000 libraries • 96 countries • ~1,000 employees • Cataloging • Interlibrary Loan • Preservation • Dewey Decimal Classification • netLibrary • FirstSearch
OCLC Research • Research for both • OCLC services • Membership • Metadata management • Knowledge organization • Content management • Interoperability • Systems & interaction design • ~30 employees
What do users want? • The right information • with minimum effort
How to give them what they want • Catch them where they are • Increase our data • Improve our data • Make the data work harder • Interconnect with other systems • Do all this efficiently
What has changed • Computers and telecommunications • User expectations • Digital materials • Remoteness of our users • Huge amounts of bandwidth, storage
The competition • Online booksellers • Reviews • Tables of contents • Excerpts • Inside-the-book searching • Web search engines • Speed • Full-text searching • Global coverage (of web resources) • Good enough • Ourselves • Electronic journals
Live search Registries, PURLs Dewey browser Harvesting, electronic theses VIAF, LAF SRU/W, OpenURLs, OAI FRBR, xISBN Beowulf cluster Map-reduce Text searching Batch loading Open WorldCat WorldCat Wiki Publisher Names MXG Current projects (my group)
Other Research Projects • FictionFinder, Curiouser • Schema Transformation • Terminology Services • Digital Preservation • Collection Analysis • Dublin Core • FAST • User Studies • Data mining • Also: http://www.oclc.org/research/researchworks/
Catch them where they are • Google, Yahoo, etc. • Open WorldCat • Open URL • OAI-PMH • Creation too • WCat Wiki • Tags?
OpenURL • OpenURL registry • Supports version 1.0 • Also registry of OpenURL servers • Used for WikiD
WorldCat ‘Wiki’ • Opening up WorldCat to user annotations • Reviews • Notes • Tables of contents • Cover art? • Book lists? • Based on WikiD software • Full Wiki • Many features off for WorldCat • Uses OpenURL 1.0 protocol internally • Allows collections of pages of arbitrary XML schemas • Tools for the creation of simple collections • Doesn’t look like a Wiki
Tags? • Folksonomies? • User-generated key words • We’ve been here before • Is it different? • Is there another direction?
More data • Harvesting • OAI-PMH • ETDs • Batch load • 60 million records • 3 million new manifestations • Other • Cover art • Reviews • WC
Better data and organization • VIAF • FRBR • Authority files in general • LAF • Publisher names • Genre • FAST • Registries • PURLs • Generalized solution? Get them nearer to creation
FRBR • Work-set algorithm • Keys based on author/title • Authority files • Auxiliary authority files • xISBN • Used for • xISBN • Open WorldCat • FirstSearch (coming) • Collection analysis (coming) • Research
Authority Files • LAF • http://errol.oclc.org/laf/n82-54463.html • Publisher names • Not normally controlled • Looking for variations with ISBN prefixes • Also worked with dissertations
VIAF • Merge national-level files • Library of Congress (NACO) and Die Deutsche Bibliothek • Bibliographic records analyzed • 15% would be erroneous based just on names • Basic matching now completed • 435,000 matching names • < 1% mismatched • Working on • Public interface • OAI harvesting • Persistent identifiers
Registries • Show relationships between metadata • Often associated with an identifier • General solution? • Examples • Authority files • WorldCat • PURLs
PURLs • Persistent URLs • Map one URL to another • http://purl.org/hickey/outgoing -> • http://outgoing.typepad.com/ • 500,000+ PURLs • 111 million resolutions • Port to Wiki’D platform? • http://www.oclc.org/research/projects/wikid/ • String of PURL servers? • Use OAI-PMH for synchronization • Spread responsibility • Generalized solution?
More connectivity • Open URL • RSS feeds • OpenSearch, SRU/W • OAI-PMH
OpenURL • Developed to address the ‘appropriate copy’ problem • Transitioning to OpenURL 1.0 • OpenURL resolver • Accepts requests specifying • Resource • Services • Generalized syntax • Specifying a resource • Services to be performed • Metadata elements specified in registry • http://purl.org/openurl/
SRU • Simplified version of Z39.50 • Web based • SRW – SOAP • SRU – URL • Even simpler? • OpenSearch • No search syntax • Looking for common ground • MXG • Metasearch XML Gateway • Simplifies metasearcher’s lives
OAI-PMH • Method of harvesting metadata • More generally, a way of synchronizing databases • No real restriction to metadata • Becomes a repository protocol • Identifiers • Timestamps • Layered implementation • OAI • SRU • Pears
Efficient processing • Beowulf cluster • Map reduce • Text searching
Beowulf Cluster • 24 nodes • 2 processors, 4 gigabytes of RAM, 120 gigabytes disk • Gigabit network • Use it for • FRBR processing • Text indexing • Text searching • ~ 30-fold speed up on many tasks • 1 year ⇒2 weeks • 1 week ⇒ 1 day • 1 day ⇒ 1 hour • 1 hour ⇒ 2 minutes • Extremely cheap processing
Map reduce • Pioneered by Google • Petabytes of data on thousands of nodes • Adapted to our cluster • Tens of gigabytes of data on dozens of nodes • Simple functional programming paradigm • Allows batch processing across cluster
Text Searching • Spread database across cluster • Two levels of aggregation • 3 servers/node • 24-way aggregation • Aggregators run across cluster • SRU used • HTTP based • SRW (SOAP) slowed it down • Open source software
Better interfaces • More interactive • Live search • Dewey Browser • Better connected
Post-coordination of Services • Systems that expose low level services • Higher level coordination of those services • Loosely coupled services • Examples from OCLC • Validation service • RSS feeds • SRU • OpenURL, OAI-PMH • xISBN • DDC Browser built this way • Very different interfaces have been built
DDC Browser XML • <?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="/ddcbrowser/xsl/wcat.xsl" ?> • <cells> • <language>swe</language> • <cell ddc="330" count="23" /> • <cell ddc="331" count="28" /> • <cell ddc="332" count="5" /> • <cell ddc="333" count="7" /> • <cell ddc="334" count="2" /> • <cell ddc="335" count="1" /> • <cell ddc="336" count="3" /> • <cell ddc="337" count="2" /> • <cell ddc="338" count="26" /> • <cell ddc="339" count="5" /> • </cells>
Do We Need It? • Just have Google harvest everything • Our experience with Google • Fielded searching • Reliable searching • Possibility of user-supplied metadata • Cost of good metadata • Cost of non-existent metadata
Conclusions • Shift to remote users • Online availability – trend towards centralization • More flexibility in implementations • Patrons are better served • Less emphasis on physical collections
Thank you T. Hickey http://errol.oclc.org/laf/n82-54463.html Swedish Library Association 2005 October 28