310 likes | 442 Views
Mathieu LAFOURCADE lafourcade@lirmm.fr. Fabien JALABERT jalabert@lirmm.fr. Definition Clustering, Sense Naming & Lexical Augmentation. Study context 1/2. Natural Language Processing Lexical Semantics - WSD - Document indexing
E N D
MathieuLAFOURCADE lafourcade@lirmm.fr FabienJALABERT jalabert@lirmm.fr Definition Clustering, Sense Naming & Lexical Augmentation
Study context 1/2 • Natural Language Processing • Lexical Semantics • - WSD • - Document indexing • Dictionary construction and vectorization • pb extracting definition meta-language • example : ‘cannibale’ = ‘qui mange l’Homme en parlant de l’Homme’ • themes : homme, manger, rhétorique • Multi-source approach noise reduction • problem : atom element = definition ≠ sense • Objectives • - clustering definitions to obtain senses • - naming these senses
Sense 1 – Name Study context 2/2 Sense naming Clustering Term T def 1 - Source 1 Catégorie 1 Sense 1 t def 2 - Source 1 1 def 1 - Source 1 t def 3 - Source 1 2 Sense 2 def 1 - Source 2 t def 2 - Source 1 3 def 2 - Source 2 t def 2 - Source 2 4 def 1 - Source 3 def 1 - Source 3 t def 2 - Source 3 5 Sense 3 t Multi-source base 6 def 3 - Source 1 def 1 - Source 2 t Sense 2 – Name n def 2 - Source 3 ‘Acception’ or sense base Sense 2 – Name Re-injection as new lexical source
Summary • Model, Construction, Organization • Definition Clustering • Sense Naming • Lexical Augmentation • Results
transports maritimes et fluviaux oiseau arme Conceptual Vector Model 1/2 Salton Deerwester • An idea = a vector • A vector component = a primitive as defined in a Th. • Thesaurus Larousse : 873 concepts • Concepts are inter-related Generator space • A definition a vector Chauché Lafourcade Most activated primitives for ‘frégate’ : (oiseau 6134) (transports maritimes et fluviaux 5644) (arme 4891) …
x y Conceptual Vector Model 2/2 Thematic distance = angle between two vectors Thematicaly terms close to ‘frégate’ : (destroyer0.2246) (youyou 0.2267) (voilier 0.2268) (contre-torpilleur 0.2274) (chlamydère 0.2276) (oiseau-jardinier 0.2295) (trois-mâts 0.233) … Thematicaly terms close to ‘frégate/oiseau/’ : (oiseau-jardinier 0.1237) (plumeur 0.1319) (goglu 0.136) (travailleur 0.136) (chlamydère 0.1385) (penne 0.141) (Galliformes 0.1422) (agami 0.1428) … Thematicaly terms close to‘frégate/bateau/’ : (démâtage 0.1604) (dégréer 0.1676) (naval 0.1718) (bateau-piège 0.1774) (bateau-vanne 0.1821) (batelet 0.1824) …
Definition Vector Computation SYGMART Chauché 1 2 PHAMBG 3 13 PH PH 4 7 12 14 19 23 GN GV . GN GV . 5 6 8 9 15 16 18 20 22 le petit briser GN le GA brise GN glacer 17 21 10 11 petit le le glace
Multi-Agent Organization Double-loop Lecerf Schwab Learning agents : Sygmart, computation of vectors from definition, synonymy, antonymy, … Agent Endogenous loop Other agents (society) Exogenous loop
Clustering Objective Grouping definitions into senses
Clustering 1/5 Strategy • Deep analysis - several criteria • No training (but enhancement through exogenous loop) • Frontier between senses and definitions • Centroïd approach • Heuristics (preferences) • - cluster number = nb max of definitions in dictionaries • - two definitions of a same source two different clusters
Clustering 2/5 Difficulty ‘botte’
Clustering 3/5 Algorithm 1/2 • Source by source iteration • until obtaining a min value distribution • Affectation of min. value source/cluster • From a distance matrix : Hungarian method – O(n3) Kuhn Ford, Fulkerson
Clustering 4/5 Algorithm 2/2 • For each criteria • one evaluation • one distance matrix • Criteria • Comparing lexical contents of definitions • (with term frequency, co-occurrences, etc.) • Angular distance • Symbolic markers • - morphology • - etymology (‘avocat’: ‘ahuacatl’ / ‘advocatus’ ) • - use (‘vieux’ , ‘ancien’, ‘poétique’ … ) • - language level(‘argot’, ‘familier’, … ) • -domain(‘médecine’, ‘zoologie’, … )
Clustering 5/5 Results Correct results in many cases 90 % for nouns, 70 % for verbs - to be done for adj Pb with very strong polysemy vagueness, continuity in meanings support verb: ‘prendre’,… Study augmentation of cluster number ‘botte’ We would like to designate meanings
Sense Naming Objective To give the system some capacity to « talk about a sense »
Sense Naming 1/10 Properties • Dictionary independent • Interface (man-system & system-system) • A new lexical source looping :-) • Semantic annotation La frégate/vaisseau/ naviguait à travers les océans La frégate/oiseau/planait à travers les nues en poussant son cri incomparable
Sense Naming 2/10 Procedure • Extraction • Validation and dispatching of polysem bags bijection • Evaluation of candidates ordering and extracting the most appropriate ones
Sense Naming 3/10 Extraction • Extraction attached to a meaning • Morpho-syntactic analysis of the definition • Extraction of markers : « anc. », « méd. », … • Extraction from unstructured or semi-structured data (XML…) ‘frégate’ : [nf] [ancien] Au XVe s., grande barque demi-pontée gréant deux voiles latines sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet] • Extraction from polysem bags • Word list (like synonym list of Université de Caen : ) Ploux, Victori ex: ‘botte’ = chaussure, bottillon, coup, attaque, amas, bouquet,…
Sense Naming 4/10 Validation Bijection being able to re-associate the proper meaning ƒ :(term, sense) (term, annotation) ƒ-1 :(term, annotation) (term, sense) • A candidate associated to a sense should be closer of its own sensethan any other • Unattached candidates are associated to the closest meaning • A candidate should not be present in a concurrent definition
Sense Naming 5/10 Evaluation • Extraction grade • Evaluating the capacity to disambiguate • (to distinguish a sense from all others) • Evaluating the capacity to associate • Cognitive cost reduction Prince
Sense Naming 6/10 Extraction grade • ‘frégate’ : [nf] [ancien] Au XVe s., grande barque demi-pontée gréant deux voiles latines • sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet] GV COD Sujet CC , CC antennes deux voiles latines sur … grande barque demi-pontée gréant au XVe
vaisseau frégate t.11 w.1 0,85 (navire) (oiseau) 0,3= d1 0,8 0,95 w.2 (navire ancien) t.12 (sanguin) 0,4= d2 0,2= d3 1,2 Ma = d1 - d2 = 0,1 Mr = 0,1 / d1= 0.33 Rns = d3 / 0,33= 0.6 w.3 (navire moderne) Sense Naming 7/10 absolute margin relative margin risk of ‘non-sens’ Disambiguation capacity 1/2
vaisseau frégate voilier frégate t.11 w.1 w.1 t.11 0,29 = d2 0,85 (oiseau) (navire) (oiseau) (oiseau) 0,7 0,65= d3 0,3= d1 0,72 0,8 0,95 0.25 = d1 w.2 w.2 (navire ancien) (navire ancien) t.12 t.12 (sanguin) 0,4= d2 (navire) 0,72 0,2= d3 0,3 1,2 Ma = d1 - d2 = 0,1 Ma = d1 - d2 = 0,04 Mr = 0,04 / d1= 0,16 Mr = 0,1 / d1= 0.33 Rns = d3 / 0,16= 4 Rns = d3 / 0,33= 0.6 w.3 w.3 (navire moderne) (navire moderne) Sense Naming 8/10 Disambiguation capacity 2/2
Sense Naming 9/10 Cognitive cost survey Done for 13 terms totalizing 38 definitions 134 answers • collocations • (botte de paille, …) • co-occurrences • (Tintin Milou) • synonyms and hyperonyms • (manger se nourrir, mouche insecte animal) • domain / context for technical terms • (médecine, architecture, agriculture, sport, …) Church Daille Véronis
Mel’cuk Schwab Sense Naming 10/10 Results • multi-criteria approach seems adapted • easily extensible • strong precision • enhancement needed for meta-language processing • criteria implementation • (associative memory, lexical functions ) • synthesis grammar • (botte/secret/vs. botte/secrète/) ‘botte’ Useful for multilingual lexical databases
Lexical Augmentation Multilingual Lexical Database Some terms are not lexicalized in some language Objective lexicalize these terms
Lexical Augmentation 1/2 Papillon project Boitet Mangot-Lerebours Sérasset Lepage ACCEPTIONS ENGLISH FRANCAIS abats de volaille giblets giblets abats offal abats offal.1 beef offal abats de bœuf porc offal offal.2 abats de porc refuse refuse scrap déchet
Lexical Augmentation 2/2 Procedure • Extraction from definition and sense mane (glosses of dictionaries) • abats = {‘porc’, ‘volaille’, ‘bœuf’, …} • Patterns • ‘abats de volaille’, ‘abats en volaille’, … • Patterns validation with co-occurrences • relative number de hits in Google • Difficulties • ‘dog meat’ ‘viande pour chien’ / ‘viande de chien’ ?
Conclusion • Clustering • promissing results • manual evaluation on 100 difficult terms, • 70 % of proper clusters, 30 % of bad affectation locutions • pb to increase the cluster number • maturing of the basic clusters • Sens Naming complementary with conceptual vectors • Good precision • manual evaluation 90 % of pertinent terms • automatic evaluation 70 % (angular distance) • Towards a synthesis grammar • botte/secret/ botte/secrète/ • Future works • More criteria • (associative memory, more lexical functions) • Enhance definition analysis (meta-language)
Contribution Theoric formalisation de la ‘capacité de désambiguïsation’ et du ‘risque de non-sens’ formalisation de l’annotation en sémantique lexicale proposition d’une mesure de similarité générique entre définitions Pratical implémentation sous forme d’agents catégorisation, nommage (services sur la Toile) augmentation lexicale (en cours) Diffusion un poster à RECITAL’2003 (Batz sur Mer – 10 – 14 juin 2003) un article à Papillon’2003 (Sapporo – 2 – 6 juillet 2003) soumission pour RFIA’2004