1 / 39

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison . Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005. Introduction. Research Goals.

marlon
Download Presentation

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005

  2. Introduction

  3. Research Goals • Develop a service to support advanced collection intelligence • Cluster collected objects based on their issuing entity • As can be determined via metadata about the objects • Gain intelligence about the nature of individual publishers • Collection intelligence • Acquisition patterns • User behavior

  4. Research Objectives • Resolve • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Capture and make available for use various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries

  5. Theoretical Foundation: Authority Control • Adhere to authorized form • Personal names • Corporate entities • Why no authorized form for publishing entities?

  6. Pragmatic Foundation: Collection Development • Identified publisher series • Retrospective conversion project (1984) • Family tree • Which publishers are related? • Approval plans • Which publishers publish which subjects?

  7. Pragmatic Foundation: OCLC WorldCat Data Mining • Collection Analysis • Which libraries have the most items by a publisher in a particular subject area? • How do library holdings by publisher compare? • E-books for a particular STM publisher (2000) • Cataloged as reproductions • 2 publishers!

  8. Pragmatic Foundation: Citation Analysis • Sweetland (1989) • Reader functions of citations • Information retrieval via citation databases • Document retrieval • Includes interlibrary loan verification • Bibliometrics • Faculty and researcher productivity measure • Other functions • Creation of references/bibliographies

  9. Pragmatic Foundation : Education for Librarians • Collection development & acquisitions librarian education • Subject focuses of publishers • Parent and subsidiary relationships

  10. Specialized Corporate Authority Files • ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and international Catholic institutions, Catholic religious communities, and institutions • Related to the Catholic Church, Papal State, and Vatican City State • COPAR (Boddaert, 2004) • French official corporate bodies • Mainly national and preceding the French Revolution • CORELI (Boddaert, 2004) • Religious corporate bodies from 3 French ancient specialized catalogues

  11. Specialized Corporate Authority Files • Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004) • Chinese authors of expanded works and Chinese corporate bodies since 1912 • Chinese Name Authority Database (Hu, Tam & Lo, 2004) • Mainly Taiwanese personal names with some Taiwanese corporate bodies

  12. Specialized Corporate Authority Files • Case study by Elias & Fair (1983) • Standard Oil Co.’s Media Query File • No authority control • 3 professionals in 6 months averaged 12 telephone calls/day from reporters • Decided against canonical list for media names • Noted 20 unique variants for Wall Street Journal including WSJ, Wall St. Jnl, Wall Street Jnl

  13. Specialized Corporate Authority Files • Case study by French, Powell & Schulman (1997, 2000) • Smithsonian Astrophysical Observatory’s Astrophysics Data System database • Programmatically identify author affiliations and map variant names to canonical name • Investigated various techniques separately and iteratively to bring variants together including: • Lexical cleanup • Data clustering algorithms • Approximate string-matching • Reduced number of unique strings by 55% • Required manual review of clusters

  14. Database Quality

  15. Literature: Database Quality • Review by O’Neill & Vizine-Goetz (1988) • Busch (1981) • < 35% of 141 OCLC libraries routinely reported errors • Pollock & Zamora (1983) • Noted misspellings comprise 90-96% of errors & include: • Omission • Insertion • Substitution • Transposition

  16. Literature: Database Quality • Intner (1989) • Reviewed 215 matching records in OCLC and RLIN • Errors relating to publishers:

  17. Literature: Database Quality • Romero (1994) • Evaluated cataloging of library science students • Noted 221 errors (28.22%) in the publisher description area

  18. Issues: Historical Practices • Different rules for abbreviations • LC Rule Interpretation B.14 • State postal (2-letter) abbreviation if it appears in the item along with the place • Anglo-American Cataloguing Rules, Revised (2002) • Abbreviations included in Appendix B.14

  19. Issues: Historical Practices • ALA Catalog Rules (1941) • Multiple places of publication and publishers and neither or first is prominent • Include first listed first, indicate omission • Multiple places of publication and publishers and first is not prominent • Include prominent first • Include first listed second • Unknown place of publication – [n.p.]

  20. Issues: Historical Practices • Anglo-American Cataloging Rules (1967) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include prominent only, omit others • Unknown place of publication – [n. p.]

  21. Issues: Historical Practices • Anglo-American Cataloguing Rules, Revised (2002) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include first listed first • Include prominent second • Unknown place of publication – [S.l.]

  22. Issues: Historical and Local Practices • “u.a.” • At least one German institution uses “u.a.” as mark of omission • Means “et al.” • Not an AACR2r rule • Local practice? • Is local practice/policy an error?

  23. Issues: Historical and Local Practices • WorldCat enhanced records • Eliminate or lessen the probability of these issues

  24. Examining Quality of WorldCat

  25. WorldCat: Publisher Name Selection Criteria • Fixed field lang = “eng”

  26. WorldCat: ISBN Validation Errors • WorldCat records with ISBNs: 22.69%

  27. WorldCat: ISBN Validation Errors

  28. WorldCat: MARC Tagging Errors • Examined English language records based on some known issues and manual evaluation • Total MARC tagging errors found: 11,874 (0.03%)

  29. WorldCat: MARC Tagging Errors • MARC 260 vs 300 tagging • In 260 field, information from 300 field in $a, $b, $c and/or $e • Dates tagging • Date in $a or $b • Five digit year • “cm” follows year

  30. WorldCat: Typographical Errors • Used “Typographical Errors in Library Databases” to identify and quantify English language WorldCat errors (Ballard, 2005) • Total errors: 26,599 (0.08%) • Require manual examination to determine if actual errors • Searching for Institi* • Misspelled: • American Institite of Physics • British Standards Institition • Spelled correctly: • Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin Institute for Advanced Studies)

  31. WorldCat: Typographical Errors • Top words (10.4%):

  32. WorldCat: Typographical Errors • “Westminister” • Only included on Ballard list in combination with other words • Total errors in WorldCat: 628 (2.36%) • Require manual review

  33. Where are we now?

  34. WorldCat: MARC 260 Evaluation • Top 10 terms in 260 $b in WorldCat

  35. WorldCat: MARC 260 Evaluation • University Press names in 260 $b in WorldCat

  36. Clustering • Attempting programmatic clustering of publishers using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait

  37. WorldCat: Clustering Example • Used ISBN prefix 019 (Oxford University Press) • Total WorldCat records: 58,004,317 • Records with ISBN prefix 019: 84,276 (0.15%) • Non-unique publisher names from ISBN prefix records: 91,528

  38. Challenges: Publisher Name Authority File • Quality issue • Level of acceptance for cluster • What is acceptable? • Subsidiaries and Relationships • Oxford & Auckland • Examined manually to determine relationship • Form of name • What is acceptable? • Likely to use the most prominent form of name

  39. Questions and Discussion Contact Information: connawal@oclc.org hearda@oclc.org Project Web Site: http://www.oclc.org/research/projects/publisherns/

More Related