240 likes | 470 Views
An Unsupervised WSD Algorithm for a NLP System. Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí. INDEX. Introduction Architecture for the NLP System WSD Method Evaluation Conclusions Future Work. Introduction.
E N D
An Unsupervised WSD Algorithm for a NLP System Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí
INDEX • Introduction • Architecture for the NLP System • WSD Method • Evaluation • Conclusions • Future Work
Introduction • Natural Language Processing (NLP) techniques are necessary for current information systems. • One problem of natural language is the ambiguity (phonological, morphological, syntactic, semantic or pragmatic). • The resolution of lexical ambiguity is necessary for certain NLP applications: Machine Translation, Information Retrieval, Information Extraction, etc.
Introduction • Word Sense Disambiguation (WSD) is an intermediate task that attemps to resolve lexical ambiguity problem, assigning to each word its appropriate meaning. • WSD uses two information sources: • Context. • External Knowledge Sources. • WSD approaches: • Knowledge-driven. • Data-driven.
Introduction • WSD method characteristics: • Knowledge-driven. • Unsupervised. • Information sources: • EuroWordNet. • Untagged large corpus. • Sense assignment uses paradigmatic information. • Easily adaptable to other languages.
Identifies sentence’s constituents Untagged text Extracts all possible POS-tags Set of nouns derived from lexical-semantic relations of EWN Selects only one morphosyntactic category Text annotated with POS-tags, chunks and noun senses Architecture for the PLN System INPUT POS-analyser (MACO) Corpus POS-tagger (RELAX) Shallow parser (TACAT) WSD module Sense Discriminators EWN OUTPUT
WSD method • It operates on paradigmatic information. • It extracts paradigmatic information for an ambiguous occurrence and it maps this information to the paradigmatic information from the lexicon. • It lays on the base that semantically similar words can substitute each other in the same context and, inversely, words that can commute in a context have a good probability to be close semantically.
WSD method • It uses a POS-tagged corpus for searching syntactic patterns (the corpus of EFE News Agency, over 70M words). • For the identification of patterns, it follows a structural criterion, using a list of basic patterns and search schemes. • Each syntactic pattern is identified at the lemmas and POS levels.
WSD method • Syntactic patterns: X-R-Y • X and Y are lexical content units (nouns, adjectives, verbs and adverbs). • R is a relational element (functional words: prepositions, conjunctions, ). • The pattern expresses a syntactic relation between X and Y. • Examples: • grano - nounde - prepositionazúcar - noun • pasaje - nounsubterráneo - adjective
WSD method • Definition of basic patterns: • N, N • N C N • N P N • N A • N V • A N • V N Conjunctions = {y, e, o, u} N Noun R Adverb A Adjective V Participle Verb C* Conjunction D Determinant
WSD method • Each basic pattern has discontinuous realisations in texts. • We pre-establish morphosyntactic schemes for the search of patterns; e.g.: N(((R) R) A/V), ((D) D)(((R) R) A/V) N N(((R) R) A/V) C* ((D) D)(((R) R) A/V) N N(((R) R) A/V) P((D) D) (((R) R) A/V)N N((R) R) A(C* ((R) R)A/V) N((R) R) V(C* ((R) R)A/V) (A/VC* ((D) D)(((R) R))A N (A/VC* ((D) D)(((R) R))V N • The units between brackets are optional, those separated by a bare are alternatives for a position.
Corona danesa Corona sueca WSD method • For each search scheme, we define decomposition rules in order to extract the basic patterns. Example: • Each unit of the sequence is considered also at the lemma level. Coronas danesas y suecas NAC*A NA NA
Paradigmatic relations pieza concierto Syntagmatic relations obra para órgano WSD method • Information is extracted from two sources: • Corpus (paradigmatic information). • Sentences (syntagmatic information). • Paradigmatic information is extracted by exploiting the syntactic patterns Example:
WSD method • Sense discriminators obtained from EWN: • Selection of all nouns related to each sense along the different lexical-semantic relations. • Elimination of the common elements between different senses. • Disjunctive sets of nouns for the senses of a word.
WSD method • Commutative test: • Hypothesis: If two words can commute in a given context, they have a good probability to be semantically close. • Application: If the ambiguous word can be substituted with a sense discriminator inside a syntactic pattern, then it has the sense corresponding to that discriminator. • The algorithm operates with words from a sense-untagged corpus
WSD method • Commutative Test Algorithm Corpus dij SD1 YES X_i0 – R - Y X – R - Y __ – R - Y Xk – R - Y Xk di0j SDi0 NO X_? – R - Y dnj SDn
WSD method • WSD module has two heuristics: • H1:Commutative Test Algorithm applied on the paradigmatic information (the nouns obtained from substituting the ambiguous occurrence in the pattern). • H2:Commutative Test Algorithm applied on the syntagmatic information (the nouns obtained from the sentence). • The two heuristics act as voters for the sense assignment.
Heuristic 1 Heuristic 2 S1 SD1 = S1 SD2 = S1 SD3 S1 SD4 = S1 SD5 = S2 SD1 = S2 SD2 = S2 SD3 S2 SD4 = S2 SD5 = Sense 1: órgano vegetal, espora, flor, pera, manzana, bellota, hinojo, semilla, poro, píleo, carpóforo, ... Sense 2: agencia, unidad administrativa, banco central, servicio secreto, seguridad social, FBI, ... Sense Discriminators Sets Sense 3: parte del cuerpo, trozo, músculo, riñón, oreja, ojo, glándula, lóbulo, tórax, dedo, articulación, rasgo, facción, ... From sentence From corpus NACN Scheme Sense 4: instrumento de viento, instrumento musical, mecanismo, aparato, teclado, pedal, corneta, ... progreso, científico, mortalidad, multitud, enfermedad, mortal, trasplante, implantación, órgano, parte, cuerpo, expectativa, vida mediador, terreno, chófer, árbol, cabeza, planeta, parte, incremento, totalidad, guerrilla, programa, mitad, país, temporada, artículo, tercio Sense 5: periódico, publicación, medio de comunicación, método, serie, serial, número, ejemplar, ... Decomposition Rules NA NCN Final Result órgano dañado órgano o parte WSD method • Example: Los enormes y continuados progresos científicos y técnicos de la Medicina actual han logrado hacer descender espectacularmente la mortalidad infantil, erradicar multitud de enfermedades hasta hace poco mortales, sustituir mediante trasplante o implantación del cuerpo inutilizadas y alargar las expectativas de vida. órganos dañados o partes órganos dañados o partes • 1. Input text POS-tagging. • 4. Extraction of Sense Discriminators. • 2. Syntactic patterns identification. • 2.1. Use of search schemes. • 2.2. Use of decomposition rules. • 5. Commutative Test. • 6. Final sense asignment órgano#3: A fully differentiated structural and functional unit in an animal that is specialized for some particular function. • 3. Extraction of information. • 3.1. From corpus. • 3.2. From sentence.
Evaluation • The WSD method was tested with the Spanish Lexical Sample task of Senseval-2. • For the evaluation, we selected all 17 nouns of this task. • We used the two heuristics H1 & H2.
Precision Recall Coverage H1 0,54 0,11 0,21 H2 0,59 0,04 0,07 H1 + H2 0,56 0,15 0,27 Evaluation • Results obtained:
Evaluation • In Senseval-2, the values for the individual words reached the following level: • Precision = 51,4% - 71,2% • Recall = 50,3% - 71,2% • Coverage = 98% – 100%
Conclusions • This WSD method can be used as a module in a NLP system to prepare an input text to a real application. • It is independent of any corpus tagging at syntactic or semantic level. • It requires only a minimal preprocessing phase (POS-tagging) of the input text and of the search corpus.
Future work • Study of different possibilities to improve the WSD process. • Aplication of new algorithms over information associated to the ambiguous occurrence. • Combination with other data-driven WSD methods.
An Unsupervised WSD Algorithm for a NLP System Thank you!! Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí