330 likes | 353 Views
Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst. Overview. WordNet: paradigmatic vs syntagmatic information Recurrent Free Phrases Encoding RFP through Phrasets and Syntagmatic Relations Getting RFPs in bilingual dictionaries and corpora
E N D
Extending WordNet with syntagmatic information Luisa Bentivogli, Emanuele Pianta ITC-irst 2nd GWC, January 20th-23rd 2004 - Brno
Overview • WordNet: paradigmatic vs syntagmatic information • Recurrent Free Phrases • Encoding RFP through Phrasets and Syntagmatic Relations • Getting RFPs in bilingual dictionaries and corpora • Conclusions 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic An international conference took place in Brno 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic Czech Republic meeting Prague national symposium An international conference took place in Brno 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression semantic restriction 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression free phrase semantic restriction 2nd GWC, January 20-23 2004 - Brno
Paradigmatic vs Syntagmatic Paradigmatic relations (in absentia) Czech Republic meeting Prague national symposium An international conference took place in Brno multiword expression free phrase semantic restriction Syntagmatic relations (in presentia) 2nd GWC, January 20-23 2004 - Brno
Why is syntagmatic info useful • From a lexicographic point of view • See examples of usage in dictionaries (and WN itself) • Often a very short phrase • Sometimes more useful than definitions • From a computational point of view • statistics oriented, corpus based methods • crucial role of co-occurrence information • co-occurrence of words vs meanings 2nd GWC, January 20-23 2004 - Brno
Lexical units in WordNet • Criterium for inclusion in synsets: only lexicalized concept • What counts as a lexical unit • Simple words: {tree} • Idioms • non compositional meaning • {rollercoaster, big_dipper, ...} • Restricted collocations • compositional, reduced substitution, no literal translation • {criminal_record, record} (Italian: precedenti penali) • Named entities: {Praha, capital_of_the_Czech_Repubblic, …} 2nd GWC, January 20-23 2004 - Brno
Problems with inclusion criteria - 1 • Artificial nodes: synsets with no lexical unit • {social_group}– {gruppo_sociale} • Free combinations of words (Benson et al., 1986) • DEF: a combination of words following only the general rules of syntax • Restricted collocations: • reduced substitution, no literal transl., but compositional • ex: circulatory system (*blood, *circulation system) • are they lexical unit? • should we include them in synsets? • Can we “keep” information currently contained in artificial nodes and restricted collocations without violating the criterium for inclusion in synsets? 2nd GWC, January 20-23 2004 - Brno
Problems with inclusion criteria - 2 • A considerable number of expressions which aresystematically used to express a concept are excluded from (Multi)WordNet as they are not lexical units • Ex: “andare in bicicletta” [to bike] • andare: to move by walking or using a means of locomotion • in bicicletta: by bike • Ex: “punta di freccia” [arrowhead] 2nd GWC, January 20-23 2004 - Brno
Introducing Recurrent Free Phrases • Recurrent free phrase (RFP): a free combination of words which is recurrently used to express a concept • 1. Syntactically constrained: N|V|A|P Phrases (cfr. restricted collocations) • 2. High frequency (“governo italiano” Italian government) • 3. High degree of association (“prima volta” first time) • 4. Salience: • intuition of the native speaker lexicographer that a certain expression picks up a concept which is perceived as relevant and somehow unitary • not necessarily related to frequency and word association • “vertice internazionale” international summit (high salience) • “coscia destra” right thigh 2nd GWC, January 20-23 2004 - Brno
The salience criterium • Hypothesis: • Related to the amount of world knowledge that is attached to a certain phrase • Such knowledge cannot be inferred from the meanings composing the phrase • Example: • right hand (more salient) • right thigh 2nd GWC, January 20-23 2004 - Brno
Recurrent Free Phrases for NLP • Knowledge-based word alignment of parallel corpora • EX: cornfield ~ campo di grano • Word Sense Disambiguation • campo: 12 senses in MWN • grano: 9 senses • both unambiguous in “campo di grano” 2nd GWC, January 20-23 2004 - Brno
Criteria for RFP selection • RFPs expressing a concept which is not lexicalized in a language but lexicalized in another language (lexical gaps) • EX: andare in bicicletta [to bike] • RFPs synonyms with a lexical unit in the same language • EX: strofinaccio dei piatti / canovaccio [dishcloth] • RPFs that are frequent, cohese and salient within a corpus considered as reference corpus • EX: vertice internazionale [international summit] • RPFs whose components are highly polysemous. • EX: campodi grano [cornfield ] 2nd GWC, January 20-23 2004 - Brno
MultiWordNet • MultiWordNet: Italian/English lexical database • Princeton WordNet building criteria • Strict alignment (see expand model) • Explicit treatment of lexical gaps • Italian (44,000 words) and • Hebrew (University of Haifa, just started) • Cfr Spanish WordNet (EuroWordNet) 2nd GWC, January 20-23 2004 - Brno
Introducing Phrasets • Phraset: a set of synonymous recurrent free phrases ENG-synset {cornfield} ITA-synset {GAP} ITA-phraset {campo_di_grano} ENG-synset {toilet_roll} ITA-synset {GAP} ITA-phraset {rotolo_di_carta_igienica} ENG-synset {dishcloth} ITA-synset {canovaccio} ITA-phraset {strofinaccio_dei_piatti, strofinaccio_da_cucina} 2nd GWC, January 20-23 2004 - Brno
RFPs vs definitions RFPs are not definitions E-synset {tree -- a tall perennial wody plant having a main trunk …} I-synset {albero -- ogni pianta perenne con fusto legnoso ramificato} I-phraset {} E-synset {paperboy} I-synset {GAP – ragazzo che recapita i giornali} I-phraset {ragazzo_dei_giornali} E-synset {straphanger} I-synset {GAP – chi viaggia in piedi su mezzi pubblici reggendosi ad un sostegno} I-phraset {} 2nd GWC, January 20-23 2004 - Brno
Synsets vs Phrasets Free combination of words Recurrent Free Phrases Phrasets Restricted collocations Named entities Synsets Idioms Simple words 2nd GWC, January 20-23 2004 - Brno
Syntagmatic Relations in WN • MEANING project: using the involve semantic relation to encode deep selectional restrictions • Can RFP be encoded through semantic relations? 2nd GWC, January 20-23 2004 - Brno
Encoding “campagna antifumo” -1 Through phrasets Synset: {campagna} Phraset: {} campaign hypernym Synset: {GAP} Phraset: {campagna_antifumo} campaign against smoking 2nd GWC, January 20-23 2004 - Brno
Encoding “campagna antifumo” - 2 Through a semantic relation has_constraint Synset: {campagna} Synset: {antifumo} campaign against smoking 2nd GWC, January 20-23 2004 - Brno
Pros and cons of using semantic rels for encoding RPFS • Smart and concise but what about • trigram RFP? • synonymous RFPs • RPFs that are translation equivalent of lexical units? • Restrictions on word order and word morphology? 2nd GWC, January 20-23 2004 - Brno
Taking the best of both encodings • Phrasets and lexical syntagmatic relations appezzamento (parcel) cereale (cereal) hypernym hypernym campo (field) frumento, grano(corn) composed-of (grano) composed-of (campo) hypernym GAP -- campo di grano (cornfield) 2nd GWC, January 20-23 2004 - Brno
RFP in Bilingual Dictionaries • Collins bilingual dictionary (medium size) • Italian Translation Equivalents (Bentivogli and Pianta, 2000) • 92.2% correspond to lexical units • 7.8% correspond to free combination of words (lexical gaps) • Manual check of 300 lexical gaps • 67% correspond to RFPs => More than half of the synsets which are gaps in Italian potentially have an associated phraset 2nd GWC, January 20-23 2004 - Brno
RFPs in corpora • Correlation between RPFs and frequency? • Analysis of a 32M word corpus (Repubblica, 2000-2001) • Standard n-gram analysis package (NSP) • All bigrams including at least a stopword excluded • 118,464 bigrams occurring more than 3 times • Highest rank: 5,914 occurrences (“New York”) • Rank 4: 31,453 bigrams • 497 distinct ranks (frequence classes) 2nd GWC, January 20-23 2004 - Brno
RFPs in corpora cont. • Lower ranks are systematically and densely populated • Higher ranks are sparsely and poorly populated • Rank groups • A: 5,914-509 (100 bigrams) • B: 505-257 (257) • C: 256-129 (731) • D: 128-65 (1,965) • E: 64-33 (4,525) • F: 32-17 (10,477) • G: 16-9 (22,167) • H: 8-5 (46,798) • I: 4 (31,453) • Manual check of 100 random bigrams from each rank group 2nd GWC, January 20-23 2004 - Brno
RFPs in corpora cont. Manual check of 100 random bigrams from each rank group NB: similar results on trigrams 2nd GWC, January 20-23 2004 - Brno
Correlation between num. of RFPs and frequency in a reference corpus 2nd GWC, January 20-23 2004 - Brno
Future work • Better characterization and classification • Correlation with association measures • Evaluating RFP for WSD 2nd GWC, January 20-23 2004 - Brno
Conclusions • Wordnet is poor of syntagmatic information • We introduced Recurrent Free Phrases, Phrasets, syntagmatic lexical relations • RFP: free combination of word recurrently used to express a concept • Criteria for their selection • Bilingual dictionaries contain many RFPs • Corpora: no clear correlation with frequency • Useful for: • lexicographic work • Word Sense Disambiguation 2nd GWC, January 20-23 2004 - Brno