Linguistic Processing in Lattice-Based Taxonomy Construction

Linguistic Processing in Lattice-Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science CLA 2010 Seville, Spain. October19-21, 2010.

Outline • Motivation in Social Studies and the Data • Building a lattice-based taxonomy over a text corpus • Natural language processing techniques for automatic attributes acquisition • Keywords extraction • Probabilistic latent modeling of text • Named entity recognition

Motivation • Represent the structure of a given domain in a form of a lattice-based taxonomy • Interdisciplinary research project “Discrete mathematical models for political analysis of democratic institutions and human rights" • Speeches of Western leaders and international organizations • The context in which Russia is addressed • The role and importance of democracy and human rights agenda • Construct a context from the text corpora • Extract the set of attributes from texts for describing the documents • Analyze and develop natural language processing methods

The Data: 26 fullspeeches of foreign leaders

Constructing lattice-based taxonomy over a text corpus • Preliminary text processing • Attributes extraction for describing the documents • Building and pruning the lattice

Three kinds of taxonomies • Three kinds of taxonomies depending on the attributes type: • frequent words • latent topics • named entities

Building a taxonomy with frequent words • eliminating of stop-words • stemming - collapsing all morphological variants of the term to a single root form • describing each document with its N most frequent terms • building and pruning the lattice

31 formal concepts of the lattice based on frequent words Figures in squares show the number of documents in each concept

According to word frequencies taxonomy: • security issues and relationships of Russia with Europe are the most discussed topics along with some global problems • democracy and human rights are not included in the presented taxonomy due to pruning • words "democracy", "human" and "right" appear in the concepts which include speeches by Barack Obama and Hillary Clinton.

Probabilistic latent semantic analysis (pLSA) • P( z ) – the distribution over topics z in a particular document • P( w | z ) – the probability distribution over words w given topic z • T is the number of topics

Building a taxonomy with latent topics • probabilistic modeling of text: • documents are represented as random mixtures over latent topics • each topic is characterized by a distribution over words. • 20 topics were derived from the 26 documents • 20 topics were used as attributes for describing the documents

6 of the 20 received topics from the documents: words distributions over topics

17 formal concepts of the lattice based on latent topics

According to the latent topics - taxonomy • The most actual topics are those connected with: • European Union • global problems • security issues • energy resources • Russian-Georgian conflict • possible ways of solving conflicts and problems • The topic of democracy and human rights is not included in the presented taxonomy due to pruning • the concept with this topic includes speeches by Barack Obama and Nicolas Sarcozy

Building a taxonomy with Named Entities • 38 paragraphs derived from the 26 and enlighten solely issues concerning Russia • three types of named entities for describing the documents • names of persons • organizations • geographical objects

21 concepts of a lattice built from paragraphsand named entities

Conclusion remarks • several techniques have been proposed to build a context over a text corpus • frequent words allowed to define what questions are raised most frequently by foreign leaders regarding Russia • latent topic modeling allowed to specify and describe these issues more thoroughly • Named-entity would be more informative to use in the context of latent topics • the corpus of the texts should be expanded

Thank you!

Linguistic Processing in Lattice-Based Taxonomy Construction

Linguistic Processing in Lattice-Based Taxonomy Construction

Presentation Transcript

Modern Automotive Technology

Credit Card Processing in SAP

Linguistic humor

A Taxonomy for Learning, Teaching and Assessing A Revision of Bloom’s Taxonomy of Educational Objectives

Language processing: introduction to compiler construction

Comparing L1 and L2 acquisition

Lattice and Boolean Algebra

GRS LX 700 Language Acquisition and Linguistic Theory

Linguistic Features of Jamaican Creole

Crystal Defects Chapter 6

Processing XML Documents

TAXONOMY

GRS LX 700 Language Acquisition and Linguistic Theory

สจพ

CONSTRUCTION INSURANCE

Linguistic Changes in L2 Oral Performance by Chinese English Majors Across Four Years

Taxonomy

Speech Processing

Speech Processing

CS463.7 Information Flow