1 / 40

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users. Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006. WorldCat: July 2006. Manifestations (records): 67,282,165. Works: 53,472,668. Total holdings: 1,071,507,045. Digital Items: 1,571,803.

art
Download Presentation

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capturing Untapped Descriptive Data:Creating Value for Librarians and Users Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006

  2. WorldCat: July 2006 Manifestations (records): 67,282,165 Works: 53,472,668 Total holdings: 1,071,507,045 Digital Items: 1,571,803 Institutions: 26,236 Physical Items*: ~1.6 billion *Estimated

  3. Origin of materials represented in WorldCat Unknown 14% Rest of World 40% US 34% Canada 3% UK 9%

  4. Some aspects of “Global WorldCat” … Materials w/non-US origins: 35.3 million (52%) Top 5: UK: 6.1 million Germany: 4.0 million France: 2.9 million Netherlands: 2.2 million Canada: 2.1 million Content Languages: 476 43% of WC non-English Top 5 non-English: German: 4.5 million French: 4.2 million Spanish: 2.9 million Dutch: 2.1 million Chinese: 1.6 million Non-English Metadata Language: 9.3 million (20 languages) Top 5: Dutch: 4.1 million Japanese: 0.7 million French: 1.4 million Finnish: 0.7 million German: 1.0 million

  5. OCLC WorldCatTM: Decision-making Resource • Collection management • Cooperative collection development • Comparative collection analysis • Collection assessment • Mass digitization • Off-site storage • Preservation • Services • Virtual reference • Recommender services • Systems • Precision

  6. OCLC WorldCatTM: Data Mining Research Projects • Audience Level • Publisher Name Server • WorldMap

  7. Audience Level: Rationale and Objectives Holdings represent selection decisions by librarians … implies there are about 1 billion individual selection decisions in the WorldCat holdings file • Selections are made to serve the interests of a library’s target community … • Associate target community (audience level) to particular library profiles - e.g., ARL, non-ARL academic, public, K-12 school … ? • Implies: we can infer materials’ audience level from holdings patterns, which in turn can support: • Collection management • Readers’ advisory services • Reference services • Information retrieval

  8. Example : Mother Goose

  9. Publisher Name Server: Research Objectives • Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Complement Collection Analysis Service • Librarians • Publishers • Capture and make available various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries

  10. Publisher Name Server: Methodology • Programmatically cluster publishers using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait • Hand parse the entities and resolve ISBN prefixes

  11. Publisher Name Server: Database • To date >800 records • Relational database, preserving hierarchical relationships • Begins with high-occurrence entities to identify: • “Top 10” lists (USA, UK, Canada, Australia, Germany, France, Japan, Italy) • Top university presses • Mergers and acquisitions

  12. Top U.S. Publishing Entities in WorldCat(22,680,201 total U.S. records)

  13. Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL ----- Languages Formats DDC Subjects LCC Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers’ Weekly Online Hoover’s Handbook Online Standard and Poor’s Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING Publisher Name Server: Database

  14. Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Pearson Canada Pearson Technology Group Allen Lane Ladybird Books Riverhead Books Copp Clark Adobe Press Cisco Press Puffin Books Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Allyn and Bacon Prentice-Hall, Inc. Dominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co.

  15. OCLC WorldMapTM: Objectives • Geographically represent library data from UNESCO, ARL, and NCES • Number of libraries • Amount of library expenditures • Number of volumes and titles • Number of librarians • Number of users

  16. OCLC WorldMapTM: Objectives • Research prototype • Test geographical representation of WorldCat • Titles and holdings by country of publication • Support data mining research area • Visually display mined data to ease review and analysis • Internal use • Sales and marketing • External use • Library collection assessment and comparison • Complement the AAU/ARL Global Resources Network project • Project of the Council on Library and Information Resources (CLIR)

  17. OCLC WorldMapTM:Technology • First implemented SVG • Open standard maintained by W3C • Simple XML file • Young technology • Browser support limited • Requires plug-in • Converted to Flash • Browser compatibility • Plug-in compatibility (if a plug-in was installed!) • For a detailed comparison of SVG and Flash, see: http://www.carto.net/papers/svg/comparison_flash_svg/

  18. OCLC WorldMapTM

  19. Potential Future Projects • Audience Level • Integrate into WorldCat.org and OPACS to limit searches and retrieved sources • Publisher Name Server • Integrate into OCLC Collection Analysis Service for publisher business intelligence • WorldMap • Subject information “aboutness” • Language of item • Content language • Metadata language • Holdings by country of library

  20. Presentation will be available athttp://www.oclc.org/research/presentations/default.htmPrototypes available athttp://www.oclc.org/research/researchworks/default.htmProject Web Site:http://www.oclc.org/research/projects/default.htm

  21. Questions and Discussion Contact Information: connawal@oclc.org

More Related