190 likes | 385 Views
Linguistic Processing in Lattice-Based Taxonomy Construction. Anastasia Novokreshchenova , Maria Shabanova , Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science.
E N D
Linguistic Processing in Lattice-Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science CLA 2010 Seville, Spain. October19-21, 2010.
Outline • Motivation in Social Studies and the Data • Building a lattice-based taxonomy over a text corpus • Natural language processing techniques for automatic attributes acquisition • Keywords extraction • Probabilistic latent modeling of text • Named entity recognition
Motivation • Represent the structure of a given domain in a form of a lattice-based taxonomy • Interdisciplinary research project “Discrete mathematical models for political analysis of democratic institutions and human rights" • Speeches of Western leaders and international organizations • The context in which Russia is addressed • The role and importance of democracy and human rights agenda • Construct a context from the text corpora • Extract the set of attributes from texts for describing the documents • Analyze and develop natural language processing methods
Constructing lattice-based taxonomy over a text corpus • Preliminary text processing • Attributes extraction for describing the documents • Building and pruning the lattice
Three kinds of taxonomies • Three kinds of taxonomies depending on the attributes type: • frequent words • latent topics • named entities
Building a taxonomy with frequent words • eliminating of stop-words • stemming - collapsing all morphological variants of the term to a single root form • describing each document with its N most frequent terms • building and pruning the lattice
31 formal concepts of the lattice based on frequent words Figures in squares show the number of documents in each concept
According to word frequencies taxonomy: • security issues and relationships of Russia with Europe are the most discussed topics along with some global problems • democracy and human rights are not included in the presented taxonomy due to pruning • words "democracy", "human" and "right" appear in the concepts which include speeches by Barack Obama and Hillary Clinton.
Probabilistic latent semantic analysis (pLSA) • P( z ) – the distribution over topics z in a particular document • P( w | z ) – the probability distribution over words w given topic z • T is the number of topics
Building a taxonomy with latent topics • probabilistic modeling of text: • documents are represented as random mixtures over latent topics • each topic is characterized by a distribution over words. • 20 topics were derived from the 26 documents • 20 topics were used as attributes for describing the documents
6 of the 20 received topics from the documents: words distributions over topics
According to the latent topics - taxonomy • The most actual topics are those connected with: • European Union • global problems • security issues • energy resources • Russian-Georgian conflict • possible ways of solving conflicts and problems • The topic of democracy and human rights is not included in the presented taxonomy due to pruning • the concept with this topic includes speeches by Barack Obama and Nicolas Sarcozy
Building a taxonomy with Named Entities • 38 paragraphs derived from the 26 and enlighten solely issues concerning Russia • three types of named entities for describing the documents • names of persons • organizations • geographical objects
21 concepts of a lattice built from paragraphsand named entities
Conclusion remarks • several techniques have been proposed to build a context over a text corpus • frequent words allowed to define what questions are raised most frequently by foreign leaders regarding Russia • latent topic modeling allowed to specify and describe these issues more thoroughly • Named-entity would be more informative to use in the context of latent topics • the corpus of the texts should be expanded