300 likes | 448 Views
Studying the History of Ideas Using Topic Models. D. Hall, D. Jurafsky , & C. D. Manning Standord University EMNLP 2008. Agenda. Introduction Methodology Historical trends in computation l inguistics Is computational l inguistics b ecoming m ore a pplied?
E N D
Studying the History of Ideas Using Topic Models D. Hall, D. Jurafsky, & C. D. Manning Standord University EMNLP 2008
Agenda • Introduction • Methodology • Historical trends in computation linguistics • Is computational linguistics becoming more applied? • Differences and similarities among COLING, ACL, and EMNLP • Conclusion
Goal • Identify and study the exploration of ideas in a scientific field over time. • Periods of gradual development. • Major ruptures. • Waxing and waning of both topic areas and connections with applied topics and nearby fields?
Change of ideas • Rather than deal with papers or authors, this paper is focused on the change of ideas in a field over time. • Apply Kuhn’s insight that vocabulary and vocabulary shift is a crucial indicator of ideas and shifts in ideas. • Operationalize on the unsupervised topic model Latent Dirichlet Allocation, LDA (Blei et al. 2003)
Analyzing the trends in CL • 12,500 documents of the ACL Anthology have been analyzed. • The CL field gotten more theoretical or more applied? • What topics have declined over the years, and which ones have remained constant? • How have fields like Dialogue or MT changed over the years? • Are there differences among the conferences?
ACL Anthology • A public repository of all papers in the major journals, conferences, and workshops. • Computational Linguistics. • ACL, COLING, EMNLP, and so on. • Comprises over 14,000 documents. • From 1965 to 2008. • Indexed by conference and year. • Used as the basis of citation analysis work. (Joseph & Radev, 2007)
Latent Dirichlet Allocation (LDA) • A generative latent variable model that treats documents as bags of words generated by one or more topics. • Each document is represented as a multinomial distribution over topics. • Each topic is in turn characterized by a multinomial distribution over words. • Parameter estimation using collapsed Gibbs sampling (Griffiths & Steyvers, 2004)
Topic Modeling • The empirical probability that an arbitrary paper d written in year y was about topic z: • I is the indicator function, td is the year document d was written, and p(d|y) = 1/C.
Topic selection • Apply LDA to induces 100 topics, and took 36 that are relevant. • Hand selected seed words for 10 more topics to improve coverage of the field. • These 46 topics were used as priors to a new 100-topic run. • Finally, 43 topics are selected.
Trend of probabilistic models • The probabilistic model topic increases around 1988, which seems to have been an important year for this topic. • What do the papers from 1988 tell us about how probabilistic models entered the field?
Analysys • 9 of 10 the papers appeared in conference proceedings rather than journal. • New ideas appear in conferences. • 5 of conference papers appeared in COLING compared to only 1 in ACL. • COLING is more receptive than ACL to new ideas. • 6 of 10 papers either focus on speech or were written by authors who had published on speech recognition topics. • Speech recognition is an EE field which made early use of probabilistic and statistical methodologies.
Including lexical semantics, conceptual semantics/story understanding, computational semantics, WordNet, WSD, semantic role labeling, RTE and paraphrase, MUC information extraction, and events/temporal.
Is CL becoming more applied? Including machine translation, spelling correction, dialogue systems, information retrieval, call routing, speech recognition, and biomedical applications.
Six applied topics over time The years 1989-1994 correspond exactly to the DARPA Speech and Natural Language Workshop, held at different location.
Differences and similarities among COLING, ACL, and EMNLP • Whether the topics of these conferences are converging or not. • Are the probabilistic and machine learning trends that are dominant in ACL becoming dominant in COLING as well? • Is EMNLP adopting some of the topics that are popular at COLING?
Divergence between the 3 conferences The Jensen-Shannon (JS) divergence between each pair of conference are plotted.
Conclusion • Proposed method discovers a number of trends in the computational linguistics. • Show a convergence over time in topic coverage of ACL, COLING, and EMNLP as well an expansion of topic diversity. • The growth and convergence of the 3 conferences, perhaps influenced by the need to increase recall seems to be leading toward a tripartite realization of a single new “latent” conference.