An Unsupervised WSD Algorithm for a NLP System

An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí

INDEX • Introduction • Architecture for the NLP System • WSD Method • Evaluation • Conclusions • Future Work

Introduction • Natural Language Processing (NLP) techniques are necessary for current information systems. • One problem of natural language is the ambiguity (phonological, morphological, syntactic, semantic or pragmatic). • The resolution of lexical ambiguity is necessary for certain NLP applications: Machine Translation, Information Retrieval, Information Extraction, etc.

Introduction • Word Sense Disambiguation (WSD) is an intermediate task that attemps to resolve lexical ambiguity problem, assigning to each word its appropriate meaning. • WSD uses two information sources: • Context. • External Knowledge Sources. • WSD approaches: • Knowledge-driven. • Data-driven.

Introduction • WSD method characteristics: • Knowledge-driven. • Unsupervised. • Information sources: • EuroWordNet. • Untagged large corpus. • Sense assignment uses paradigmatic information. • Easily adaptable to other languages.

Identifies sentence’s constituents Untagged text Extracts all possible POS-tags Set of nouns derived from lexical-semantic relations of EWN Selects only one morphosyntactic category Text annotated with POS-tags, chunks and noun senses Architecture for the PLN System INPUT POS-analyser (MACO) Corpus POS-tagger (RELAX) Shallow parser (TACAT) WSD module Sense Discriminators EWN OUTPUT

WSD method • It operates on paradigmatic information. • It extracts paradigmatic information for an ambiguous occurrence and it maps this information to the paradigmatic information from the lexicon. • It lays on the base that semantically similar words can substitute each other in the same context and, inversely, words that can commute in a context have a good probability to be close semantically.

WSD method • It uses a POS-tagged corpus for searching syntactic patterns (the corpus of EFE News Agency, over 70M words). • For the identification of patterns, it follows a structural criterion, using a list of basic patterns and search schemes. • Each syntactic pattern is identified at the lemmas and POS levels.

WSD method • Syntactic patterns: X-R-Y • X and Y are lexical content units (nouns, adjectives, verbs and adverbs). • R is a relational element (functional words: prepositions, conjunctions, ). • The pattern expresses a syntactic relation between X and Y. • Examples: • grano - nounde - prepositionazúcar - noun • pasaje - nounsubterráneo - adjective

WSD method • Definition of basic patterns: • N, N • N C N • N P N • N A • N V • A N • V N Conjunctions = {y, e, o, u} N  Noun R  Adverb A  Adjective V  Participle Verb C* Conjunction D  Determinant

WSD method • Each basic pattern has discontinuous realisations in texts. • We pre-establish morphosyntactic schemes for the search of patterns; e.g.: N(((R) R) A/V), ((D) D)(((R) R) A/V) N N(((R) R) A/V) C* ((D) D)(((R) R) A/V) N N(((R) R) A/V) P((D) D) (((R) R) A/V)N N((R) R) A(C* ((R) R)A/V) N((R) R) V(C* ((R) R)A/V) (A/VC* ((D) D)(((R) R))A N (A/VC* ((D) D)(((R) R))V N • The units between brackets are optional, those separated by a bare are alternatives for a position.

Corona danesa Corona sueca WSD method • For each search scheme, we define decomposition rules in order to extract the basic patterns. Example: • Each unit of the sequence is considered also at the lemma level. Coronas danesas y suecas NAC*A NA NA

Paradigmatic relations pieza concierto Syntagmatic relations obra para órgano WSD method • Information is extracted from two sources: • Corpus (paradigmatic information). • Sentences (syntagmatic information). • Paradigmatic information is extracted by exploiting the syntactic patterns Example:

WSD method • Sense discriminators obtained from EWN: • Selection of all nouns related to each sense along the different lexical-semantic relations. • Elimination of the common elements between different senses. • Disjunctive sets of nouns for the senses of a word.

WSD method • Commutative test: • Hypothesis: If two words can commute in a given context, they have a good probability to be semantically close. • Application: If the ambiguous word can be substituted with a sense discriminator inside a syntactic pattern, then it has the sense corresponding to that discriminator. • The algorithm operates with words from a sense-untagged corpus

WSD method • Commutative Test Algorithm Corpus dij SD1 YES X_i0 – R - Y X – R - Y __ – R - Y Xk – R - Y Xk di0j SDi0 NO X_? – R - Y dnj SDn

WSD method • WSD module has two heuristics: • H1:Commutative Test Algorithm applied on the paradigmatic information (the nouns obtained from substituting the ambiguous occurrence in the pattern). • H2:Commutative Test Algorithm applied on the syntagmatic information (the nouns obtained from the sentence). • The two heuristics act as voters for the sense assignment.

Heuristic 1 Heuristic 2 S1  SD1 =  S1  SD2 = S1  SD3  S1  SD4 =  S1  SD5 =  S2  SD1 =  S2  SD2 = S2  SD3  S2  SD4 =  S2  SD5 =  Sense 1: órgano vegetal, espora, flor, pera, manzana, bellota, hinojo, semilla, poro, píleo, carpóforo, ... Sense 2: agencia, unidad administrativa, banco central, servicio secreto, seguridad social, FBI, ... Sense Discriminators Sets Sense 3: parte del cuerpo, trozo, músculo, riñón, oreja, ojo, glándula, lóbulo, tórax, dedo, articulación, rasgo, facción, ... From sentence From corpus NACN Scheme Sense 4: instrumento de viento, instrumento musical, mecanismo, aparato, teclado, pedal, corneta, ... progreso, científico, mortalidad, multitud, enfermedad, mortal, trasplante, implantación, órgano, parte, cuerpo, expectativa, vida mediador, terreno, chófer, árbol, cabeza, planeta, parte, incremento, totalidad, guerrilla, programa, mitad, país, temporada, artículo, tercio Sense 5: periódico, publicación, medio de comunicación, método, serie, serial, número, ejemplar, ... Decomposition Rules NA NCN Final Result órgano dañado órgano o parte WSD method • Example: Los enormes y continuados progresos científicos y técnicos de la Medicina actual han logrado hacer descender espectacularmente la mortalidad infantil, erradicar multitud de enfermedades hasta hace poco mortales, sustituir mediante trasplante o implantación del cuerpo inutilizadas y alargar las expectativas de vida. órganos dañados o partes órganos dañados o partes • 1. Input text POS-tagging. • 4. Extraction of Sense Discriminators. • 2. Syntactic patterns identification. • 2.1. Use of search schemes. • 2.2. Use of decomposition rules. • 5. Commutative Test. • 6. Final sense asignment órgano#3: A fully differentiated structural and functional unit in an animal that is specialized for some particular function. • 3. Extraction of information. • 3.1. From corpus. • 3.2. From sentence.

Evaluation • The WSD method was tested with the Spanish Lexical Sample task of Senseval-2. • For the evaluation, we selected all 17 nouns of this task. • We used the two heuristics H1 & H2.

Precision Recall Coverage H1 0,54 0,11 0,21 H2 0,59 0,04 0,07 H1 + H2 0,56 0,15 0,27 Evaluation • Results obtained:

Evaluation • In Senseval-2, the values for the individual words reached the following level: • Precision = 51,4% - 71,2% • Recall = 50,3% - 71,2% • Coverage = 98% – 100%

Conclusions • This WSD method can be used as a module in a NLP system to prepare an input text to a real application. • It is independent of any corpus tagging at syntactic or semantic level. • It requires only a minimal preprocessing phase (POS-tagging) of the input text and of the search corpus.

Future work • Study of different possibilities to improve the WSD process. • Aplication of new algorithms over information associated to the ambiguous occurrence. • Combination with other data-driven WSD methods.

An Unsupervised WSD Algorithm for a NLP System Thank you!! Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí

An Unsupervised WSD Algorithm for a NLP System

An Unsupervised WSD Algorithm for a NLP System

Presentation Transcript

NLP 1 An Introduction to Pragmatics in NLP

An Algorithm for Bootstrapping Communications

“Cheap” Tricks for NLP: An “Invited” Talk

TUPDD0104: Will an innovative, unsupervised

Three Approaches to Unsupervised WSD

An NLP Application: Designing an English-to-ASL Machine Translation System

Data Structures for NLP A Tutorial for NLP (CSE 562/662)

An Unsupervised Learning Approach for Overlapping Co-clustering

A method for WSD on Unrestricted Text

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules

An Algorithm for: Explaining Algorithms

Unsupervised Evolutionary Clustering Algorithm for Mixed Type Data

A Cellular Automaton Model for an Immune System Derived Search Algorithm

WSD Special Programs

Towards a Water Scarcity and Drought Indicator System (WSD i S)

WSD for Applications

An Unsupervised Learning Approach for Overlapping Co-clustering

Data Structures for NLP A Tutorial for NLP (CSE 562/662)

Spark NLP for Healthcare System