400 likes | 508 Views
Capturing Untapped Descriptive Data: Creating Value for Librarians and Users. Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006. WorldCat: July 2006. Manifestations (records): 67,282,165. Works: 53,472,668. Total holdings: 1,071,507,045. Digital Items: 1,571,803.
E N D
Capturing Untapped Descriptive Data:Creating Value for Librarians and Users Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006
WorldCat: July 2006 Manifestations (records): 67,282,165 Works: 53,472,668 Total holdings: 1,071,507,045 Digital Items: 1,571,803 Institutions: 26,236 Physical Items*: ~1.6 billion *Estimated
Origin of materials represented in WorldCat Unknown 14% Rest of World 40% US 34% Canada 3% UK 9%
Some aspects of “Global WorldCat” … Materials w/non-US origins: 35.3 million (52%) Top 5: UK: 6.1 million Germany: 4.0 million France: 2.9 million Netherlands: 2.2 million Canada: 2.1 million Content Languages: 476 43% of WC non-English Top 5 non-English: German: 4.5 million French: 4.2 million Spanish: 2.9 million Dutch: 2.1 million Chinese: 1.6 million Non-English Metadata Language: 9.3 million (20 languages) Top 5: Dutch: 4.1 million Japanese: 0.7 million French: 1.4 million Finnish: 0.7 million German: 1.0 million
OCLC WorldCatTM: Decision-making Resource • Collection management • Cooperative collection development • Comparative collection analysis • Collection assessment • Mass digitization • Off-site storage • Preservation • Services • Virtual reference • Recommender services • Systems • Precision
OCLC WorldCatTM: Data Mining Research Projects • Audience Level • Publisher Name Server • WorldMap
Audience Level: Rationale and Objectives Holdings represent selection decisions by librarians … implies there are about 1 billion individual selection decisions in the WorldCat holdings file • Selections are made to serve the interests of a library’s target community … • Associate target community (audience level) to particular library profiles - e.g., ARL, non-ARL academic, public, K-12 school … ? • Implies: we can infer materials’ audience level from holdings patterns, which in turn can support: • Collection management • Readers’ advisory services • Reference services • Information retrieval
Publisher Name Server: Research Objectives • Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Complement Collection Analysis Service • Librarians • Publishers • Capture and make available various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries
Publisher Name Server: Methodology • Programmatically cluster publishers using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait • Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database • To date >800 records • Relational database, preserving hierarchical relationships • Begins with high-occurrence entities to identify: • “Top 10” lists (USA, UK, Canada, Australia, Germany, France, Japan, Italy) • Top university presses • Mergers and acquisitions
Top U.S. Publishing Entities in WorldCat(22,680,201 total U.S. records)
Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL ----- Languages Formats DDC Subjects LCC Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers’ Weekly Online Hoover’s Handbook Online Standard and Poor’s Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING Publisher Name Server: Database
Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Pearson Canada Pearson Technology Group Allen Lane Ladybird Books Riverhead Books Copp Clark Adobe Press Cisco Press Puffin Books Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Allyn and Bacon Prentice-Hall, Inc. Dominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co.
OCLC WorldMapTM: Objectives • Geographically represent library data from UNESCO, ARL, and NCES • Number of libraries • Amount of library expenditures • Number of volumes and titles • Number of librarians • Number of users
OCLC WorldMapTM: Objectives • Research prototype • Test geographical representation of WorldCat • Titles and holdings by country of publication • Support data mining research area • Visually display mined data to ease review and analysis • Internal use • Sales and marketing • External use • Library collection assessment and comparison • Complement the AAU/ARL Global Resources Network project • Project of the Council on Library and Information Resources (CLIR)
OCLC WorldMapTM:Technology • First implemented SVG • Open standard maintained by W3C • Simple XML file • Young technology • Browser support limited • Requires plug-in • Converted to Flash • Browser compatibility • Plug-in compatibility (if a plug-in was installed!) • For a detailed comparison of SVG and Flash, see: http://www.carto.net/papers/svg/comparison_flash_svg/
Potential Future Projects • Audience Level • Integrate into WorldCat.org and OPACS to limit searches and retrieved sources • Publisher Name Server • Integrate into OCLC Collection Analysis Service for publisher business intelligence • WorldMap • Subject information “aboutness” • Language of item • Content language • Metadata language • Holdings by country of library
Presentation will be available athttp://www.oclc.org/research/presentations/default.htmPrototypes available athttp://www.oclc.org/research/researchworks/default.htmProject Web Site:http://www.oclc.org/research/projects/default.htm
Questions and Discussion Contact Information: connawal@oclc.org