220 likes | 363 Views
UNL Lexical Selection with Conceptual Vectors. LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian Boitet LIRMM, Montpellier GETA, CLIPS, IMAG, Grenoble Christian.Boitet@imag.fr http://www-clips.imag.fr/geta Mathieu.Lafourcade@lirmm.fr http://www.lirmm.fr/~lafourca. Outline.
E N D
UNL Lexical Selection with Conceptual Vectors LREC-2002, Las Palmas, May 2002 Mathieur Lafourcade & Christian BoitetLIRMM, Montpellier GETA, CLIPS, IMAG, Grenoble Christian.Boitet@imag.frhttp://www-clips.imag.fr/geta Mathieu.Lafourcade@lirmm.frhttp://www.lirmm.fr/~lafourca
Outline • The problem: disambiguation in UNL-French deconversion • Finding the known UW nearest to an unknown UW • Finding the best French lemma for a given UW • Conceptual vectors • Nature & example on French (873 dimensions) • Building (Dec. 201: 64,000 terms, 210,000 CVs) • CVD (CV Disambiguation) running for French • Recooking the vectors attached to a document tree • Placing each recooked vector in the word sense tree • Using CVD in UNL-French deconversion: ongoing
Validation & Localization Graph to tree conversion Lexical Transfer UNL-FRA Graph (UW) UNL-FRA Graph (French LU) UNL-L1 Graph “UNL Tree” Structural transfer GMA structure Paraphrase choice UMA structure Conceptual vectors computations Syntactic generation UMC structure Morphological generation French utterance The UNL-FR deconversion process
The problem: disambiguation in UNL-French deconversion • Find the known UW nearest to an unknown UW • known UWs: obj(open(icl>occur),door) • (in KB context) a door opens obj(open(icl>do),door) • one opens a door • input graph: obj(open(icl>occur,ins>concrete thing),door) • ins(open(icl>occur,ins>concrete thing),key…) a key opens a door / a door opens with a key • ==> choose nearest open(icl>occur) for correct result • Find best French lemma for a UW in a given context • meeting(icl>event) ==> réunion [ACTION, DURATION…] • rencontre [EVENT, MOMENT…]
How to solve them? • unknown UW best known UW • Accessing KB in real time impractical (web server) • KB not enough: still many possible candidates • known UW best LU • Often no clear symbolic conditions for selection • Possibility to transform UNLLUfr dictionary into a kind of neural net (cf. MSR MindNet) • a possible unifying solution: • Lexical selection through DCV, • Disambiguation using Conceptual Vectors • which works quite well for French on large scale experiments
Conceptual vectors • CV = vector in concept space (4th level in Larousse) • V(to tidy up) = CHANGE [0.84], VARIATION [0.83], EVOLUTION [0.82], ORDER [0.77], SITUATION [0.76], STRUCTURE [0.76], RANK [0.76] … • V(to cut) = GAME [0.8], LIQUID [0.8], CROSS [0.79], PART [0.78] MIXTURE [0.78], FRACTION [0.75], TORTURE [0.75] WOUND [0.75], DRINK [0.74] … • Global vector of a term = normalized sum of the CVs of its meanings/senses • V(head) = HEAD [0.83], . BEGINNING [0.75], ANTERIORITY [0.74], PERSON [0.74] INTELLIGENCE [0.68], HIERARCHY [0.65], …
Conceptual vectors and sense space • Conceptual vector model • Reminiscent of Vector Models (Salton and all.) & Sowa • Applied on preselected concepts (not terms) • Concepts are not independent • Set of k basic concepts • Thesaurus Larousse = 873 concepts (translation of Roget’s) • A vector = a 873 uple of reals in [0..1] • Encoding for each dimension C = 215 : [0..32767] • Sense space = vector space + vector set
x’ x y Thematic relatedness • Conceptual vector distance • Angular Distance DA(x, y) = angle (x, y) • 0 <= DA(x, y) <= • Interpretation • if DA(x, y) = 0 x // y (colinear): same idea • if DA(x, y) = /2 x y (orthogonal): nothing in common • if DA(x, y) = DA(x, y) = DA(x, -x): -x anti-idea of x
Collection process Start from a few handcrafted term/meanings/vectors <do forever> //running constantly on Lafourcade’s Mac • <choose a word at random (with or without a CV) • find NL definitions of its senses (mainly on the Web) • for each sense definition SD • analyze SD into linguistic tree TreeDef • attach existing or null CVs to lexical nodes of TreeDef • iterate propagation of CVs in TreeDef (ling. rules used here) • until CV(root) converges or limit of cycle numbers is reached • CV(sense) CV(root(TreeDef)) • use vector distance to arrange the CVs of senses into a binary « discrimination tree » • </choose> </do>
Status on French CVs • By Dec. 2001 • 64,000 terms • 210,000 CVs • Average of 3.3 senses/term • Method • robot to access web lexicon servers • large coverage French analyzer by J.Chauché in Sigmart • See more details on • http://www.lirmm.fr/~lafourca
Disambiguation in French • Recook the vectors attached to a document tree • Take a document • Analyze it with Sigmart analyzer into ONE possibly big tree (30 pages OK as a unit) • Use the same process as for processing definitions • Final CV(root) usable as thematic classifier of document • Final CV (lexemes) used as « sense in context » • Place each recooked vector in the discrimination tree • Walk down the discrimination tree, using vector distance • Stop at nearest node: • If leave node, full disambiguation (relative to available sense set) • If internal node, partial disambigation (subset of senses)
Example with some ambiguities • The white ants strike rapidly the trusses of the roof
Initialize: attach CVs to lexemes • The white ants strike rapidly the trusses of the roof
Result: sense selection • The white ants strike rapidly the trusses of the roof
Disambiguation in UNL-French deconversion • Our set-up • Example input UNL-graph • Outline of the process • Two usages of DCV (disambiguation with CV) • Finding the known UW nearest to an unknown UW • Finding the best French lemma for a given UW
A UNL input graph • Ronaldo has headed the ball into the left corner of the goal”
Corresponding UNL-treewith CVs attached: localization DCV V = Vevent(score)+ Vhuman(score) + Vsport(score) score(icl>event,agt>human,fld>sport) .@entry.@past.@complete head(pof>body): ins 1- Ronaldo: agt Vbody(head) V(human) 2- Ronaldo: pos corner: plt 1- goal(icl>thing): obj V(human) Vplace(corner) Vthing(goal) 1- goal(icl>thing): obj left: mod V(left) Vthing(goal)
Result of first step: the « best » UWs • The vector contextualization generalizes both kinds of localization (lexical and cultural). • On each node, the selected UW is the one in the UNL-French database which vector is the closest to the contextualized vector. • Formulas used for up and dow propagation:
Second step: select the « best » LUs • Depending on the strategy of the generator, a lexical unit (LU) may be • a lemma • a whole derivational family • (pay, payment, payable…) • Dictionay: <UW, CVdict> {<LUi, CVi>} • Input: <UW,CVcontext> • Output: LU i with nearest CVi
Conclusion • Another case of fruitful integration of symbolic & numerical methods • Further work planned • integration into running UNL-FR server • work on feed-back (Pr SU’s line of thought) • if user corrects the choice of LU for chosen UW • or worse, if user chooses a LU corresponding to another UW! ==> then recompute vectors by giving more weight to chosen CVs