390 likes | 454 Views
Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison . Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005. Introduction. Research Goals.
E N D
Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005
Research Goals • Develop a service to support advanced collection intelligence • Cluster collected objects based on their issuing entity • As can be determined via metadata about the objects • Gain intelligence about the nature of individual publishers • Collection intelligence • Acquisition patterns • User behavior
Research Objectives • Resolve • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Capture and make available for use various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries
Theoretical Foundation: Authority Control • Adhere to authorized form • Personal names • Corporate entities • Why no authorized form for publishing entities?
Pragmatic Foundation: Collection Development • Identified publisher series • Retrospective conversion project (1984) • Family tree • Which publishers are related? • Approval plans • Which publishers publish which subjects?
Pragmatic Foundation: OCLC WorldCat Data Mining • Collection Analysis • Which libraries have the most items by a publisher in a particular subject area? • How do library holdings by publisher compare? • E-books for a particular STM publisher (2000) • Cataloged as reproductions • 2 publishers!
Pragmatic Foundation: Citation Analysis • Sweetland (1989) • Reader functions of citations • Information retrieval via citation databases • Document retrieval • Includes interlibrary loan verification • Bibliometrics • Faculty and researcher productivity measure • Other functions • Creation of references/bibliographies
Pragmatic Foundation : Education for Librarians • Collection development & acquisitions librarian education • Subject focuses of publishers • Parent and subsidiary relationships
Specialized Corporate Authority Files • ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and international Catholic institutions, Catholic religious communities, and institutions • Related to the Catholic Church, Papal State, and Vatican City State • COPAR (Boddaert, 2004) • French official corporate bodies • Mainly national and preceding the French Revolution • CORELI (Boddaert, 2004) • Religious corporate bodies from 3 French ancient specialized catalogues
Specialized Corporate Authority Files • Chinese Modern Author Authority Database (Hu, Tam & Lo, 2004) • Chinese authors of expanded works and Chinese corporate bodies since 1912 • Chinese Name Authority Database (Hu, Tam & Lo, 2004) • Mainly Taiwanese personal names with some Taiwanese corporate bodies
Specialized Corporate Authority Files • Case study by Elias & Fair (1983) • Standard Oil Co.’s Media Query File • No authority control • 3 professionals in 6 months averaged 12 telephone calls/day from reporters • Decided against canonical list for media names • Noted 20 unique variants for Wall Street Journal including WSJ, Wall St. Jnl, Wall Street Jnl
Specialized Corporate Authority Files • Case study by French, Powell & Schulman (1997, 2000) • Smithsonian Astrophysical Observatory’s Astrophysics Data System database • Programmatically identify author affiliations and map variant names to canonical name • Investigated various techniques separately and iteratively to bring variants together including: • Lexical cleanup • Data clustering algorithms • Approximate string-matching • Reduced number of unique strings by 55% • Required manual review of clusters
Literature: Database Quality • Review by O’Neill & Vizine-Goetz (1988) • Busch (1981) • < 35% of 141 OCLC libraries routinely reported errors • Pollock & Zamora (1983) • Noted misspellings comprise 90-96% of errors & include: • Omission • Insertion • Substitution • Transposition
Literature: Database Quality • Intner (1989) • Reviewed 215 matching records in OCLC and RLIN • Errors relating to publishers:
Literature: Database Quality • Romero (1994) • Evaluated cataloging of library science students • Noted 221 errors (28.22%) in the publisher description area
Issues: Historical Practices • Different rules for abbreviations • LC Rule Interpretation B.14 • State postal (2-letter) abbreviation if it appears in the item along with the place • Anglo-American Cataloguing Rules, Revised (2002) • Abbreviations included in Appendix B.14
Issues: Historical Practices • ALA Catalog Rules (1941) • Multiple places of publication and publishers and neither or first is prominent • Include first listed first, indicate omission • Multiple places of publication and publishers and first is not prominent • Include prominent first • Include first listed second • Unknown place of publication – [n.p.]
Issues: Historical Practices • Anglo-American Cataloging Rules (1967) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include prominent only, omit others • Unknown place of publication – [n. p.]
Issues: Historical Practices • Anglo-American Cataloguing Rules, Revised (2002) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include first listed first • Include prominent second • Unknown place of publication – [S.l.]
Issues: Historical and Local Practices • “u.a.” • At least one German institution uses “u.a.” as mark of omission • Means “et al.” • Not an AACR2r rule • Local practice? • Is local practice/policy an error?
Issues: Historical and Local Practices • WorldCat enhanced records • Eliminate or lessen the probability of these issues
WorldCat: Publisher Name Selection Criteria • Fixed field lang = “eng”
WorldCat: ISBN Validation Errors • WorldCat records with ISBNs: 22.69%
WorldCat: MARC Tagging Errors • Examined English language records based on some known issues and manual evaluation • Total MARC tagging errors found: 11,874 (0.03%)
WorldCat: MARC Tagging Errors • MARC 260 vs 300 tagging • In 260 field, information from 300 field in $a, $b, $c and/or $e • Dates tagging • Date in $a or $b • Five digit year • “cm” follows year
WorldCat: Typographical Errors • Used “Typographical Errors in Library Databases” to identify and quantify English language WorldCat errors (Ballard, 2005) • Total errors: 26,599 (0.08%) • Require manual examination to determine if actual errors • Searching for Institi* • Misspelled: • American Institite of Physics • British Standards Institition • Spelled correctly: • Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin Institute for Advanced Studies)
WorldCat: Typographical Errors • Top words (10.4%):
WorldCat: Typographical Errors • “Westminister” • Only included on Ballard list in combination with other words • Total errors in WorldCat: 628 (2.36%) • Require manual review
WorldCat: MARC 260 Evaluation • Top 10 terms in 260 $b in WorldCat
WorldCat: MARC 260 Evaluation • University Press names in 260 $b in WorldCat
Clustering • Attempting programmatic clustering of publishers using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait
WorldCat: Clustering Example • Used ISBN prefix 019 (Oxford University Press) • Total WorldCat records: 58,004,317 • Records with ISBN prefix 019: 84,276 (0.15%) • Non-unique publisher names from ISBN prefix records: 91,528
Challenges: Publisher Name Authority File • Quality issue • Level of acceptance for cluster • What is acceptable? • Subsidiaries and Relationships • Oxford & Auckland • Examined manually to determine relationship • Form of name • What is acceptable? • Likely to use the most prominent form of name
Questions and Discussion Contact Information: connawal@oclc.org hearda@oclc.org Project Web Site: http://www.oclc.org/research/projects/publisherns/