940 likes | 950 Views
Thanks to Wolfgang Glänzel , Ray Mooney, Scott White, Bill Arms, Michael Nelson. Citations, Bibliometrics, Links and Pagerank. Outline. Bibliometrics Citations and link analysis Link analysis and the web. What is bibliometrics?
E N D
Thanks to Wolfgang Glänzel, Ray Mooney, Scott White, Bill Arms, Michael Nelson Citations, Bibliometrics, Links and Pagerank
Outline • Bibliometrics • Citations and link analysis • Link analysis and the web
What is bibliometrics? • The terms bibliometrics and scientometrics were almost simultaneously introduced by Pritchard and by Nalimov & Mulchenko in 1969. • According to Pritchard bibliometrics is • “the application of mathematical and statistical methods to books and other media of communication”. • Nalimov and Mulchenko defined scientometrics as • “the application of those quantitative methods which are dealing with the analysis of science viewed as an information process”. • The two terms have become almost synonyms; nowadays, the field informetrics (Gorkova, 1988) stands for a more general subfield of information science dealing with mathematical-statistical analysis of communication processes in science.
Fact about bibliometrics • Bibliometrics has evolved to a standard tool of science policy and research management. • Uses an array of indicators to measure and to map research activity and its progress. • Science indicators relying on comprehensive publication and citation statistics and other, more sophisticated bibliometric techniques, are used in science policy and research management. • A growing, often controversial, policy interest to use bibliometric techniques in measurements of research productivity and efficiency. • Has mostly ignored computational metrics. • Highly dependent on ISI database; does little data mining • Bibliometric research is not that well cited!
Bibliometric research areas • Bibliometrics is encompases subareas such as structural, dynamic and evaluative scientometrics. • Structural scientometrics: the re-mapping of the epistemological structure of science. • Dynamic scientometrics: models of scientific growth, obsolescence, citation processes, etc. • Evaluative scientometrics: indicators to be used to characterise research performance at different levels of aggregation.
What is bibliometrics dealing with and what can bibliometrics not be responsible for? • Bibliometrics can be used to develop and provide tools to be applied to research evaluation, but is not yet designed to evaluate research results. • Bibliometrics tries to combine qualitative methods and quantitative approaches. • Bibliometrics so far has been not designed to correct or even substitute peer reviews or evaluation by experts.
The three “components” of present-day bibliometrics according to its three main target-groups • Bibliometrics for bibliometricians (Methodology) • This is the domain of bibliometric “basic research”. • Bibliometrics for scientific disciplines (Scientific information) • A large but also the most diverse interest-group in bibliometrics. Due to the scientists’ primary scientific orientation, their interests are strongly related to their speciality. Here we also find joint borderland with quantitative aspects of informationretrieval. • Bibliometrics for science policy and management (Science policy) • At present the most important topic in the field. Here the national, regional, and institutional structures of science and their comparative presentation are in the foreground.
Data sources of bibliometric research and technology • Data sources of bibliometrics are bibliographies and bibliographicdatabases. Large scale analyses can only be based on bibliographic databases. • Prominent specialised databases are, e.g., Medline, Chemical Abstracts, INSPEC and Mathematical Reviews in the sciences and, e.g., Econlit,Sociological Abstracts and Humanities Abstracts in the social sciences and humanities. • Disadvantage:Lack of reference literature, incomplete address recording • The databases of the Institute for Scientific Information (Thomson - ISI) and the Science Citation Index (Expanded) have become heavily used sources of bibliometrics. • Free online sources: Google Scholar, Arxiv, CiteSeer, etc.
Problems/advantages of ISI • Problems • Not public - fee based • Journal oriented • User unfriendly • No API • Advantages: • Multidisciplinarity • Selectiveness • Completeness of addresses • Full coverage • Bibliographical references • Disadvantage: no individual subject classification for papers available.
Elements, units and measures of bibliometric research • Elements are, e.g., publications, (co‑)authors, references and citations. • Basic units in bibliometrics are usually not further subdivided. • Publications can be assigned to the journals in which they appeared, through the corporate addresses of their authors to institutions or countries, references and citations to subject categories, and so on. • Units are specific sets of elements, e.g., journals, subject categories, institutions, regions and countries to which elements can – not necessarily uniquely – be assigned. The clear definition of the assignment – or in mathematical parlance – of mappings between elements and units allows the application of mathematical models. • Accuracy of units usually not addressed • Entity disambiguation
Publication activity and authorship Publication activity is influenced by several factors. At the micro level, we can distinguish the following four factors. • the subject matter • the author’s age • the author’s social status • the observation period The publication activity in theoretical fields (e.g., mathematics) and in engineering is lower than in experimental fields or in the life sciences. Cross-field comparison – without appropriate normalisation – would not be valid. This applies above all to comparative analyses at the meso level (universities and departments).
The notion of citations in information science and bibliometrics • Citations became a widely used measure of the impact of scientific publications. • Cozzens: “Citation is only secondarily a reward system. Primarily, it is rhetorical-part of persuasively arguing for the knowledge claims of the citing document.” • L. C. Smith: "citations are signposts left behind after information has been utilized". • Cronin: Citations are "frozen footprints in the landscape of scholarly achievement … which bear witness to the passage of ideas“. • Citations are “one important form of use of scientific information within the framework of documented science communication,” Although citations cannot describe the totality of the reception process, they give, “a formalised account of the information use and can be taken as a strong indicator of reception at this level.” • Westney: “Despite its flaws, citation analysis has demonstrated its reliability and usefulness as a tool for ranking and evaluating scholars and their publications. No other methodology permits such precise identification of the individuals who have influenced thought, theory, and practice in world science and technology.” • GarfieldandWeinstockhave listed 15 different reasons for giving citations to others’ work.
The process of re-interpreting the notion of citation and its consequences interpretation citation Bibliometrics/Information science Signpost of information use uncitedness: unused information frequent cite: good reception self-cite: part of scient. communication repercussion (possible distortion of citation behaviour) re-interpretation uncitedness: low quality frequent cite: high quality self-cite: manipulation of impact Rewarding system/ Quality measure Research evaluation/Science policy
Citations as Metrics of Scientific Contribution • Introduced in the 60’s with Garfield’s Science Citation Index • Manually done • Recently, made widely accessible through automatic clustering, linking, indexing - CiteSeer, Google Scholar, more to come! • Citation counting and journal impact factor quickly became commonplace • Widely accepted among scientists/scholars as a useful measure, but not the only one • Lack of citations indicative of problems • Issues of orders of magnitude versus disciplines
Early Citation Analysis • Citation analysis literature exploded • At least 2000 pubs by 1980 - Hjerppe, 1980 • Now over 10,000 • Not only tallies and trends • Analyzing document networks (White & McCain, 1998) • Financial impact of citations to authors (Kenny & Studley, 1995; Diamond, 1986; Sauer, 1988) • Science mapping (Hummon, 1989; Kustoff, 2001) • Small incomplete data sets • A few thousand documents • Manual data collection
Arguments against citations By poorly cited authors • Hicks and Potter, 1991 • “Citations form just a thin but glistening bank, sandwiched between the rock of eons. And it is this highly limited, highly unrepresentative, yet alluringly available bank of rock that the ISI has fetishized and turned into a highly desirable and marketable commodity.” • Anderson, 1992 • glass beads in a game that values glass beads
A Critical Look at Citations • MacRoberts & MacRoberts (M&M), 1989: six categories of problems • Omission • Bias • Ignored complexity • Variation across domains • Technical issues of citation indices • Incomplete accounting Much of the above can be accounted for using computational methods
Omission and Bias • Omission of formal influences • Rare according to Cole & Cole, 1972; Garfield, 1980 • M&M found 70% omission rate, 36% best • Bias • Matthew, halo effects • Obliteration • In-house bias • M&M found no correlation between freq. idea use and citation freq. • Modern internet effects - Lawrence, 2001 (Online or Invisible)
Ignored Complexity • Ignored complexity (semantics) in citation types • Chubin & Moitra, 1975 • Affirmative • Basic, subsidiary, additional, perfunctory • Negatory • Partial vs. total • Moravcsik & Murugesan, 1975 • Conceptual vs. operational • Organic vs. perfunctory • Evolutional vs. Juxtapositional • Confirmatory vs. negatory • Only small fraction negatory • Gilbert, 1977 points out these schemes are hard to implement without info re: author intention • Lagoze disagrees (paper in reviews)
Citation Indexing • Premise: authors already use citations… use these to navigate the corpus • previously, subject indexes (and later, title indexes) were hand crafted to organize the literature • using citations, the literature organizes itself • each paper contains ~ 15 citations (p. 2) • some more (e.g., biochemistry), some less (e.g. mathematics) (p. 248) • spans time, disciplines, terminology fads
Immediacy Index Physics, Biochemistry: 60-70% Radiology: 54-58% Citations are temporally dependent… research idea: do DLs impact immediacy index? Sociology: 46% Literature: 10% now 5 years ago The frequency of citations for research front literature is a measure of the “hardness” of a field Price, 1970 (quoted on p. 72)
What is (and is not) being cited? ~ 25% of scientific papers not cited once (p. 240, citing Koshy, 1976) Of papers that are cited, the average citation rate / year is 1.7 (p. 240, citing Garfield, 1977) from Lesk, http://community.bellcore.com/lesk/columbia/session13/ figure 9.7 in Lesk
Are All Citations Good? • Self-citations: • citing yourself is not nearly as important as someone else citing your work… • but on the other hand, most work builds on previous work • estimated 10-20% of citations are self-citations (p. 149)
Are All Citations Good? • Negative citations: • “Smith & Jones (1987) were raging idiots, and they failed miserably” • “We have clearly shown our method to be superior to those of Harrison (1993), Gupta (1991), and Kim (1997)” • On the other hand… • so many papers languish uncited… if someone took the time to refute your paper, that could mean your paper was worth refuting in the first place • Very few negative citations in practice • Poor quality work ignored
Are All Citations Good? • Sloppy scholarship • write your paper, take a similar paper, copy most of their citations without considering their applicability • Ulterior motives • citing your (non co-author) friends • citing your boss, likely reviewers, etc.
Citations About Citations • Brooks, “Evidence of complex citer motivations”, JASIS 37(1), 34-36. • http://dx.doi.org/10.1002/(SICI)1097-4571(198601)37:1%3C34::AID-ASI5%3E3.0.CO;2-0 • Kim, “Motivations for hyperlinking in scholarly electronic articles: A qualitative study”, JASIS 51(10), 887-889 • http://dx.doi.org/10.1002/1097-4571(2000)51:10<887::AID-ASI20>3.0.CO;2-1 • MacRoberts & MacRoberts, “The Negational Reference: Or the Art of Dissembling”, Social Studies of Science, 14(1), 91-94 • (not online) • Work that says citations should not be used as a measure is not well cited !
Modern Citation Analysis • Computational not just quantitative. • Everything digital can be indexed and analyzed • Authors • Citations • Affiliations • Venue (conferences, journals, books, grey literature) • Discipline • Visualization • Other contributions • Federated metadata - Google Scholar • Effects of scale
Technical Issues • Variation across domains • Different citation behavior • Technical issues • Author name disambiguation • Sampling • Errors • Multiple or hyper-authorship • Incomplete accounting • Citation indexing ignores informal acknowledgement of contributions!
Modern Academic Document Analysis • Documents are digital • Automatic extraction of • Authors • Citations • Affiliations • Venue • Discipline • Visualization • Other contributions • Algorithms • Use of existing metadata • Information retrieval, graphs/networks • Heuristics • Machine learning
CiteSeer - Grouping identical citationsAutomated Citation Indexing • Citations can be written in very different formats Rosenblatt F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. [97] Rosenblatt, F. (1962). Principles of Neurodynamics. Washington, DC: Spartan. [Ros62] F. Rosenblatt. Principles of Neurodynamics. Spartan Books, 1962. • All authors or first author et al. • All subfields can contain errors • Punctuation is not used consistently • Commas used to separate fields, but may also occur in the title • Sometimes there is no punctuation at all between fields • Broad classes of methods for grouping identical citations • Edit distance measures, word occurrence or word frequency measures, structure (subfields, e.g. title, author, etc.), machine learning • CiteSeer uses regular expression, sub field method • Need a fast efficient algorithm for reindexing millions of citations. • Subfield algorithm - regular expression • Syntactic structure utilized • Subfields follow • Check with database Giles, Bollacker, Lawrence, DL98
Citationology • 100K journals with an avg of 100 papers = 10 M papers with 20 citations • Scholarly books and conferences much less in terms of numbers • 200 M total citations per annum • # of unique citations = a (# of papers in on average one year) • Over 10 years, 2G citations; all indexable and information retrieval tractable. • Gig of citation metadata easily searchable and indcxable (terabytes)
Citation Measurements • Can measure citations of: • individual authors • citation counts • individual papers • citation counts • also -- co-citations of papers: papers A & B have a relationship if cited by paper C, paper D, etc. • journals • citation counts • normalize for journal size, age, frequency of publication? • institutions • fields • Many measures, hybrid measures, alternate weightings have been proposed (trans: I’m not going to look up all the refs – MLN)
Modern computational citation methods • New measures - total counts • Venues • Institutions • Departments • Disciplines/subfields
Temporal metrics • Author document distributions • One shot wonder vs continued contributions • Citations over time • Citation half life
Citation Graph cites Paper is cited by Note that journal citations always refer to earlier work.
Graphical Analysis of Hyperlinks on the Web This page links to many other pages (hub) 2 1 4 Many pages link to this page (authority) 3 6 5
Bibliometrics: Citation Analysis • Many standard documents include bibliographies (or references), explicit citations to other previously published documents. • Using citations as links, standard corpora can be viewed as a graph. • The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. • Impact of paper!
Impact Factor • Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. • Measure of how often papers in the journal are cited by other scientists. • Computed and published annually by the Institute for Scientific Information (ISI). • The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y1 or Y2. • Does not account for the quality of the citing article.
A B Bibliographic Coupling • Measure of similarity of documents introduced by Kessler in 1963. • The bibliographic coupling of two documents A and B is the number of documents cited by bothA and B. • Size of the intersection of their bibliographies. • Maybe want to normalize by size of bibliographies?
A B Co-Citation • An alternate citation-based measure of similarity introduced by Small in 1973. • Number of documents that cite both A and B. • Maybe want to normalize by total number of documents citing either A or B ?
Autonomous Citation Indexing • What if… • there was not so much manual labor invested in finding, normalizing, and entering citations? • the source input was not limited to journals? • citation indexing could be integrated with other, non-citation oriented services?
Autonomous Citation Indexing • Implemented in the CiteSeer prototype • http://citeseer.ist.psu.edu/ • Other models: Rexa, Cora • developed by Bollacker, Giles, Lawrence at the NEC Research Institute in NJ (1997) • now run @ PSU • Limited open source • PSU • Mirrors: MIT, Zurich, NUS • focused on computer science • but could be applied to any discipline
How ACI Works • Source acquisition • look for reports, pre-prints, e-prints, etc. at “well-known locations” (other DLs, departmental servers, etc.) • requires nothing of publishing site (cf. OAI model) • author submissions • download the PS, PDF, DVI, TeX, etc. source files • Extract citations • use the PS, PDF files directly as the input • automatically finds and parses citations • apply heuristics for determining different citation styles • maintain pointers to remote copies of the reports (and cache local copies for persistence) • create linkages between papers • link to paper if you have it online, otherwise just list the citation
CiteSeer demo • Go to Google
Impact of ACI/CiteSeer • Public information of citation rankings • Heavily used • Google Scholar • Ease of automated system • Errors in classification