270 likes | 301 Views
Explore the relevance of institutions in bibliometric studies, issues with institutional address data, and the impact of identifiers. Learn about the cleaning processes for address data and institutional identifiers.
E N D
Institutional classification schemesin bibliometrics Matthias Winterhager Bielefeld University euroCRIS Membership Meeting Bonn May13, 2013
Institutions: a major object of bibliometric studies • Relevant data fields in bibliometric databases: the case of Web of Science • Institutional data: ready to use? • The advent of identifiers • Processing institutional data: how to count?
Institutional address data: two dimensions
Web of Science Institutional Data • Reprint address • Authors' institutional affiliation(s) (from 2008 onwards linked to author names) • Funding agencies • Publishers • Do not expect any institutional affiliation data before 1965
Address Data in Web of Science • Author Affiliations („work done at“): • Tech Univ Denmark, Tech Knowledge Ctr Denmark, DARC DTU Anal & Res Promot Ctr, Lyngby, Denmark • Ctr Sci & Technol Studies CEST, Bern, Switzerland • Inst Res Informat & Qual Assurance, Bonn, Germany Reprint Address: • Larsen, PO (reprint author), Marievej 10A,2, Hellerup, Denmark Present Address („current potential“): • missing
Which bibliometric indicators do depend on address data? • Almost all: • Every indicator that is (directly or indirectly) based on distinct sets of national, organisational or geographic entities • Any normalised indicator (like observed vs. expected citation ratios) that takes into account regional (e.g. EU, country, state) instead of world averages • Any indicator on cross-national, -organizational or -institutional cooperation as measured via co-authorships
Institutional address data: Issues (1) • Substantial amounts of records come without any address data (they can or cannot be included in world total counts for expected ratios) • Different proportions of missing address data per discipline (humanities) and document types • Few records with address data in “backfiles” (before 1966) • Spelling variants, misspellings, erroneous entries (samples following)
Institutional address data: Issues (2) • “Reality gap”: names for entities which • never existed (fiction institutes) • existed, but have been split, merged, renamed or closed • Geographical and organisational aspects of an address can hint to different directions; borderline cases can be complicated to assign (Max-Planck-Institutes outside Germany, EMBL, CERN, KIT, Charité Berlin)
Abbreviations in address records … can have different origins: author, publisher and database producer „Corporate and institution names may or may not be abbreviated. To be comprehensive, search for the full name of the institution … as well as the abbreviation.“ „Abbreviations for corporate and institution names used in the product database are listed below. Other address elements … may also be abbreviated.“ (from: Web of Science Help)
Cleaning of German address data (1) Project of the German competence centre for bibliometrics • Aim: assignment of (almost) every paper with at least one German address from Web of Science (or Scopus) to the relevant German institution(s) • Introduction of procedures to handle unstandardized, incomplete and incorrect address data • Testing different algorithms for reduction and standardization of data elements, extraction of cities and tree-structures • Use of geocoding services
Cleaning of German address data (2) • Maintaining a large base on institute-specific string patterns for thousands of German institutions • Assignment process still heavily based on pattern-matching procedures • Growing database of institutions (with “history”), currently ~2.000 main institutions (data from 1995-2011), mapping identifiers to external sources
Institutional Identifiers Identifiers of the project are currently being mapped to “Research Explorer”, the research directory of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) and of the German Academic Exchange Service (DAAD) in cooperation with the German Rectors' Conference (HRK).
Institutional Identifier (I2) Initiatives are underway, but it will take several years to bring such standards into operation on a broad scale (as can be seen from the case of the author identifier initiative – ORCID)
Processing institutional data: How to count? Challenges coming from international clinical trials and from high energy physics: Hundreds of different addresses from thousands of authors on a single publication – how to attribute publication (and citation) counts in the right way?
Counting methods • Complete (C): each basic unit gets 1 credit • complete-normalized (CN): all the basic units in a publication share 1 credit • Straight (S): the first basic unit gets 1 credit • Whole (W): each unique basic unit gets a credit of 1 • whole-normalized (WN): all the unique basic units 1 credit Basic units can be countries, organisations, institutes. Normalized counts are often called “fractional”.
No „gold standard“ for counting Small units may be favoured by whole counting (W) – but other effects make it difficult to give a general rule here. „40 years of publication counting have not resulted in general agreement on definitions of methods and terminology nor in any kind of standardization.” (Larsen, P.O: The state of art in publication counting, Scientometrics, 77(2), 2008, 235-251)