140 likes | 243 Views
Automatic Term Identification for Bibliometric Mapping. Nees Jan van Eck, Ludo Waltman Erasmus University Rotterdam, The Netherlands {nvaneck,lwaltman}@few.eur.nl Ed Noyons, Renald Buter Centre for Science and Technology Studies , Leiden University, The Netherlands
E N D
Automatic Term Identificationfor Bibliometric Mapping Nees Jan van Eck, Ludo Waltman Erasmus University Rotterdam, The Netherlands {nvaneck,lwaltman}@few.eur.nl Ed Noyons, Renald Buter Centre for Science and Technology Studies, Leiden University, The Netherlands {noyons,buter}@cwts.leidenuniv.nl 10th International Conference on Science and Technology Indicators Vienna, September 18, 2008 1 1 1 1
Research problem Important authors or journals in a field can be identified relatively easily based on number of citations (i.e., frequency of occurrence in reference lists) Identification of important terms based on frequency of occurrence gives poor results, with many very general terms Terms are therefore usually identified manually based on expert judgment. This has the disadvantage of being subjective labor-intensive We propose a method for (semi-)automatic term identification 4 4
Method (1) • General overview of the proposed method: • Step 1 involves: • part-of-speech tagging • lemmatizing (stemming) • identifying noun phrases (linguistic filter) • identifying linguistic units (statistical filter; Dunning, 1993) • Step 1 results in a list of linguistic units (noun phrases) that may or may not be terms linguistic units Step 1: Calculation of unithood Step 2: Calculation of termhood corpus terms 5
Method (2) • Step 2 is based on the following idea: • Example: A linguistic unit whose occurrences in a corpus of scientific texts are biased toward one or more topics is likely to refer to a domain-specific concept and, consequently, to be a term 6
Method (3) • How can different topics be identified in a corpus of scientific texts? • We use a statistical latent class model called probabilistic latent semantic analysis (PLSA; Hofmann, 2001) • PLSA provides a kind of fuzzy clustering of the linguistic units occurring in a corpus • Each cluster corresponds with a topic 7
Method (4) • The termhood of a linguistic unit is determined using an entropy-like criterion 8
Application • The proposed method is used to construct a term map of the operations research (OR) field • The map is based on 7492 abstracts of papers published in OR journals between 2001 and 2005 • A two-step approach is taken: • First, terms are identified using the proposed method • Second, the relations between terms are visualized using the VOS method • The proposed method is evaluated in two ways: • Evaluation of the terms based on the criteria of precision and recall • Evaluation of the term map based on a survey among OR experts 9
Precision and recall The proposed method (‘PLSA’) outperforms both a simple variant without PLSA (‘No PLSA’) and a naïve method based on frequency of occurrence (‘Frequency’) 10
Survey • Until now, 3 OR experts have responded (2 assistant professors and 1 full professor)
Conclusions • The results of the proposed method for (semi-)automatic term identification seem promising • For accurate results, manual verification of the identified terms remains necessary • The proposed method should be seen as a first step toward more accurate term maps for science policy decision making 16