930 likes | 1.2k Views
Acquisition of Lexical Knowledge for NLP. German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya. Setting Words and Works Structured Sources MRDs, thesauri Unstructured Sources
E N D
Acquisition ofLexical Knowledge for NLP German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya
Setting Words and Works Structured Sources MRDs, thesauri Unstructured Sources corpora Acquisition of Lexical Knowledge for NLPOutline Acquisition of Lexical Knowledge for NLP
NLP and the Lexicon Theoretical: WG, GPSG, HPSG. Practical: realistic complexity and coverage Lexical Bottleneck (Briscoe 91) Even worse for languages other than English Acquisition of Lexical Knowledge for NLPSetting Acquisition of Lexical Knowledge for NLP
Which LK is needed by a concrete NLP system? Where is this LK located? Which procedures can be applied? Acquisition of Lexical Knowledge for NLPSetting Acquisition of Lexical Knowledge for NLP
Which LK is needed by a concrete NLP system? Phonology: phonemes, stress, etc. Morphology: POS, etc. Syntactic: category, subcat., etc. Semantic: class, SRs, etc. Pragmatic: usage, registers, TDs, etc. Translations: translation links Acquisition of Lexical Knowledge for NLP Setting Acquisition of Lexical Knowledge for NLP
Where is this LK located? Human brain Structured Lexical Resources: Monolingual and bilingual MRDs Thesauri Unstructured Lexical Resources: Monolingual and bilingual Corpora Mixing resources Acquisition of Lexical Knowledge for NLP Setting Acquisition of Lexical Knowledge for NLP
Which procedures can be applied? Prescriptive approach Machine-aided manual construction Descriptive approach Automatic acquisition from pre-existing Lexical Resources Mixed approach Acquisition of Lexical Knowledge for NLP Setting Acquisition of Lexical Knowledge for NLP
Setting Words and Works Structured Sources MRDs, thesauri Unstructured Sources corpora Acquisition of Lexical Knowledge for NLPOutline Acquisition of Lexical Knowledge for NLP
Human brain: Linguistic String Project (Fox et al. 88) Lexical Information for 10,000 entries WordNet (Miller et al. 90) Semantic Information v1.6 with 99,642 synsets Comlex (Grishman et al. 94) Syntactic information 38,000 English words CYC Ontology (Lenat 95) a person-century of effort to produce 100,000 terms LDOCE3-NLP dictionary with 80,000 senses Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Structured Lexical Resources Monolingual MRDs: LDOCE learner’s dictionary 35,956 entries and 76,059 definitions 86% semantic and 44% pragmatic codes controlled vocabulary of 2,000 words (Boguraev & Briscoe 89) (Vossen & Serail 90) (Bruce & Guthrie 92), (Wilks et al. 93) (Dolan et al. 93), (Richardson 97) Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Structured Lexical Resources Other Monolingual MRDs: Webster’s (Jensen & Ravin 87) LPPL (Artola 93) DGILE (Castellón 93), (Taulé 95), (Rigau 98) CIDE (Harley & Glennon 97) AHD (Richardson 97) WordNet (Harabagiu 98) Bilingual MRDs Collins Spanish/English (Knigth & Luk 94) Vox/Harrap’s Spanish/English (Rigau 98) Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Structured Lexical Resources Thesauri: Roget’s Thesaurus 60,071 words in 1,000 categories (Yarowsky 92), (Grefenstette 93), (Resnik 95) Roget’s II and The New Collins Thesaurus (Byrd 89) Macquarie’s thesaurus (Grefenstette 93) Bunrui Goi Hyou Japanese thesaurus (Utsuro et al. 93) Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Structured Lexical Resources Encyclopaedia Grolier’s Encyclopaedia (Yarowsky 92) Encarta (Richardson et al. 98) Others Telephonic Guides Mixing structured lexical resources Roget’s Thesaurus and Grolier’s (Yarowsky 92) LDOCE, WN, Collins, ONTOS, UM (Knight & Luk 94) Japanese MRD to WN (Okumura & Hovy 94) LLOCE, LDOCE (Chen & Chang 98) Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Unstructured Lexical Resources Corpora: WSJ, Brown Corpus (SemCor), Hansard Proper Nouns (Hearst & Schütze 95) Idiosyncratic Collocations (Church et al. 91) Preposition preferences (Resnik and Hearst 93) Subcategorization structures (Briscoe and Carroll 97) Selectional restrictions (Resnik 93), (Ribas 95) Thematic structure (Basili et al. 92) Word semantic classes (Dagan et al. 94) Bilingual Lexicons for MT (Fung 95) Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Mixing structured and non-structured Lexical Resources MRDs and Corpora (Liddy & Paik 92) (Klavans & Tzoukermann 96) WordNet and Corpora (Resnik 93), (Ribas 95), (Li & Abe 95), (McCarthy 01) (Mihalcea & Moldovan 99) Words and WorksWhere is this Lexical Knowledge located? Acquisition of Lexical Knowledge for NLP
Syntactic Disambiguation (Dolan et al. 93) Semantic Processing (Vanderwende 95) WSD (Lesk 86), (Wilks & Stevenson 97), (Rigau 98) IR (Krovetz & Croft 92) MT (Knight and Luk 94), (Tanaka & Umemura 94) Semantically enriching MRDs (Yarowsky 92), (Knight 93), (Chen & Chan 98) Building LKBs (Bruce & Guthrie 92) (Dolan et al. 93) (Artola 93) (Castellón 93), (Taulé 95), (Rigau 98) Words and WorksLexical Acquisition from MRDs Acquisition of Lexical Knowledge for NLP
Japanese Projects EDR (Yokoi 95) Nine years project oriented to MT Bilingual Corpora with 250,000 words Monolingual, bilingual and coocurrence dictionaries 200,000 general vocabulary 100,000 technical terminology 400,000 concepts Words and WorksInternational Projects on Lexical Acquisition Acquisition of Lexical Knowledge for NLP
American Projects Comlex (Grishman et al. 94) Syntactic information for 38,000 words WordNet (Miller 90) Semantic information more than 123,000 words organised in 99,000 synsets more than 116,000 relations between synsets Pangloss (Knight & Luk 94) PUM, ONTOS, LDOCE semantic categories, WordNet Cyc (Lenat 95) common-sense knowledge 100,000 concepts and 1,000,000 axioms Words and WorksInternational Projects on Lexical Acquisition Acquisition of Lexical Knowledge for NLP
European Projects Acquilex I and II LA from monolingual and bilingual MRDs and corpora LE-Parole Large-scale harmonised set of corpora and lexicons for all the EU languages EuroWordNet To develop a multilingual WordNet for several European Languages Words and WorksInternational Projects on Lexical Acquisition Acquisition of Lexical Knowledge for NLP
Acquilex I Acquilex II EuroWordNet Acquisition of Lexical Knowledge for NLP Setting Acquisition of Lexical Knowledge for NLP
Lexical Knowledge Acquisition Mixed approach Dictionaries (MRD -> MTD -> LDB -> LKB) Partners Cambridge University Instituto di Linguistica Computazional de Pisa Amsterdam University Dublin University 30 months Thesis (Castellón 1993) (Taulé 1995) (Rigau 1998) Words and Works Acquilex Acquisition of Lexical Knowledge for NLP
Lexical Knowledge Acquisition Mixed approach Corpora Partners Cambridge University Instituto di Linguistica Computazional de Pisa Amsterdam University 30 months Thesis [Ribas 1995] (Acquisition of Selectional Restrictions) [Ageno ...] (Robust Parsing) [Padró 1998] (Relaxation labelling) [Màrquez 1999] (Desition Trees) Words and Works Acquilex II Acquisition of Lexical Knowledge for NLP
Multilingual WordNet Partners English, Spanish, Dutch, Italian (and French, German, Txec, Estonian) 25.000 noun synsets and 5.000 verbal synsets 30 months Thesis [Farreres ...] (Mapping of Bilingual dictionaries) [Daudé ...] (Mapping of hierarchies) Words and Works EuroWordNet Acquisition of Lexical Knowledge for NLP
Setting Words and Works Structured Sources MRDs, thesauri Unstructured Sources corpora Acquisition of Lexical Knowledge for NLPOutline Acquisition of Lexical Knowledge for NLP
Focusing on: the massive acquisition of LK from MRDs (conventional, in any language) using automatic methodologies Why MRDs? Structured Sources Acquisition of LK from MRDs The conventional dictionaries for human use usually “contain spelling, pronunciation, hyphenation, capitalization, usage notes for semantic domains, geographic regions, and propiety; ethimological, syntactic and semantic information about the most basic units of the language” (Amsler 81) Acquisition of Lexical Knowledge for NLP
LDOCE (Longman Dictionary of Contemporary English) DGILE (Diccionario General Ilustrado de la Lengua Española) DGLC (Diccionari General de la Llengua Catalana) DVHE (Diccionari Vox-Harrap’s Esencial) Structured Sources Dictionaries Acquisition of Lexical Knowledge for NLP
Higly coded, restricted vocabulary 76.059 senses in 30.373 entries LDOCE id, POS, Grammatical Code, Idiom, Pragmatic Code, Semantic Code (subject-preference), object-reference, indirect-object-preference, definition. |cheese_0_1| <n> <U-C> <> <FO--> <5> <> <> <(any of many kinds of) soft or firm solid food made from pressed and sometimes ripened milk solids (CURDs)> |cheese_0_2| <n> <C> <> <FO--> <J> <> <> <usu-a large shaped and wrapped quantity of this> |cheese_0_3| <n> <C> <green cheese> <FO--> <> <> <> <newly made cheese> Structured Sources Dictionaries: LDOCE Acquisition of Lexical Knowledge for NLP
Poorly coded, no restricted vocabulary 157.843 senses in 89.043 entries 1.4 million words in definitions and examples ((queso ) (ETIM l. caseu ) (Sense 1) (CA m.) (DEF Masa que se obtiene cuajando la leche, exprimiéndola para que deje suero y echándole sal para que se conserve: ~ de Gruyre; ~ de Roquefort; ~ de bola, el de tipo holandés, de forma esférica; ~ de hierba, el que se hace cuajando la leche con hierba a propósito; ~ manchego, el de pasta compacta, algo dura, crudo, de leche de oveja.) (Sense 2) (CA m.) (DEF ~ de cerdo, manjar hecho con carne de cerdo o jabalí, picada y prensada.) (Sense 3)(CA m.)(DEF ~ helado, helado compacto hecho en molde.) (Sense 4)(CA m.)(DEF Medio ~, tablero grueso, semicircular, que usan los sastres para planchar cuellos y solapas y para sentar costuras curvas.) (Sense 5)(CA m.)(REG fam.)(DEF Pie.) (Sense 6)(CA m.)(GEO Venez.)(DEF ~ frito, estafa.) (RELA 1)(TIPOR Rel.)(TXR Del l. caseu derivan numerosos tecn. como caseína, cáseo, caseificar, caseico, caseoso.)) Structured Sources Dictionaries: DGILE Acquisition of Lexical Knowledge for NLP
Poorly coded, no restricted vocabulary 89.360 senses in 51.135 entries ((formatge)(CC m.) (NS 1 > 1 > 1 > 0 > 0 > 0)(CG m.)(DF Massa alimentosa que s’obté coagulant la llet, esprement-ne el xerigot i consolidant la part presa.) (NS 2 > 1 > 0 > 1 > 0 > 0)(CG m.)(EX Formatge de Ma.) (NS 3 > 1 > 0 > 2 > 0 > 0)(CG m.)(EX Formatge fresc, salat.) (NS 4 > 1 > 0 > 3 > 0 > 0)(CG m.)(EX Ratllar formatge.) (NS 5 > 1 > 0 > 4 > 0 > 0)(CG m.)(EX Un formatge.) (FI f4) ) Structured Sources Dictionaries: DGLC (Fabra) Acquisition of Lexical Knowledge for NLP
Morphological Information POS (n, v, adj, adv, etc.) Derivative forms Composed forms Derivative Model (verbs) Sintactic Information Idioms Implicit Knowledge barrer_1_1limpiar (el suelo) con la escoba. freír_1_1cocer (un manjar) en aceite o grasa hirviendo. comprar_1_1adquirir (una cosa) a cambio de cierta cantidad de dinero. cazar_1_1buscar o perseguir (a las aves, fieras, etc.) para cogerlas o matarlas. Structured Sources Acquilex Acquisition of Lexical Knowledge for NLP
Explicit (LDOCE: Pragmatic Codes, Semantic Code, etc.;DGILE: Tema, sinonims, antonims, sentits figutats, etc.) Implicit jardín_1_1 Terreno donde se cultivan plantas y flores ornamentales. florero_1_4 Maceta con flores. ramo_1_3 Conjunto natural o artificial de flores, ramas o hierbas. pétalo_1_1 Hoja que forma la corola de la flor. tálamo_1_3 Receptáculo de la flor. miel_1_1 Substancia viscosa y muy dulce que elaboran las abejas, en una distensión del esófago, con el jugo de las flores y luego depositan en las celdillas de sus panales. florería_1_1 Floristería; tienda o puesto donde se venden flores. florista_1_1 Persona que tiene por oficio hacer o vender flores. camelia_1_1 Arbusto cameliáceo de jardín, originario de Oriente, de hojas perennes y lustrosas, y flores grandes, blancas, rojas o rosadas (Camellia japonica). camelia_1_2Flor de este arbusto. rosa_1_1Flor del rosal. Structured Sources Semantic Information Acquisition of Lexical Knowledge for NLP
Conventional dictionaries are not systematic Dictionaries are built for human use Implicit Knowledge words are described/translated in terms of words Structured Sources Main Problems Acquisition of Lexical Knowledge for NLP
The System General Frame Methodology SEISD Application of the methodology Structured Sources SEISD (Rigau 98) Acquisition of Lexical Knowledge for NLP
Characteristics of the Lexical Resources used Lexical Knowledge to be extracted Lexical Knowledge Representation The acquisition process SEISDGeneral Frame Acquisition of Lexical Knowledge for NLP
Characteristics of the Lexical Resources used DGILE Spanish/English bilingual Dictionaries WordNet Type System of the LKB SEISDGeneral Frame Acquisition of Lexical Knowledge for NLP
Characteristics of the Lexical Resources used DGILE 89,043 entries and 157,842 senses 1.4 million words in definitions and examples neither semantic nor pragmatic codes no restricted vocabulary SEISDGeneral Frame vino (l. vinu) m. Zumo de uvas fermentado; ... 2 fig. Bautizar o cristianizar, el ~, echarle agua. 3 fig. Dormir uno el ~, dormir mientras dura la borrachera; tener uno mal ~, ser pendenciero en la embriaguez. 4 p.ext. Zumo. | HOMOF.: vino (v.) , bino (v.) . REL. Enológico, enólogo, enotecnia, derivados de enología, ciencia de la vinicultura, formada del gr. oinos. Acquisition of Lexical Knowledge for NLP
Characteristics of the Lexical Resources used Spanish/English bilingual Dictionaries EEI: 16,463 entries with 28,002 translation fields EIE: 15,352 entries with 27,033 translation fields SEISDGeneral Frame vinom wine. ~ de Jerez, sherry; ~ tinto, red wine. winen vino Acquisition of Lexical Knowledge for NLP
Characteristics of the Lexical Resources used WordNet v1.6 has 123,497 content words and 99,642 synsets SEISDGeneral Frame Sense 1 wine, vino -- (fermented juice (of grapes especially)) => sake, saki -- (Japanese beverage from fermented rice ...) => vintage -- (a season's yield of wine from a vineyard) => red wine -- (wine having a red color derived from skins ...) => Pinot noir -- (dry red California table wine ...) => claret, red Bordeaux -- (dry red Bordeaux or Bordeaux-like wine) => Saint Emilion -- (full-bodied red wine from ...) => Chianti -- (dry red Italian table wine from the Chianti ...) => Cabernet, Cabernet Sauvignon -- (superior Bordeaux-type red wine) => Rioja -- (dry red table wine from the Rioja ...) => zinfandel -- (dry fruity red wine from California) Acquisition of Lexical Knowledge for NLP
Characteristics of the Lexical Resources used Type System of the LKB (Copestake 92) 527 types with 196 features SEISDGeneral Frame Acquisition of Lexical Knowledge for NLP
Lexical Knowledge to be extracted Explicit information (POS, TD, uses, etc.) Implicit information Hypernym/hyponym relations (class/subclass) Synonymy/Antonymy relations Meronym/Holonym relation (part/whole, ...) Case role relations (agentive, telic, ...) Content relations (qualia, form, constitutive, ...) Collocational relations (compounds, idioms, ...) Selectional restrictions (typical subject, object, ...) Translation Equivalences SEISDGeneral Frame Acquisition of Lexical Knowledge for NLP
Lexical Knowledge Representation LKB (Copestake 92) represent both syntactic and semantic information Type Feature Structure formalism (Carpenter 92) default inheritance lexical and phrasal rules multilingual relations SEISDGeneral Frame Acquisition of Lexical Knowledge for NLP
The System General Frame Methodology SEISD Application of the methodology Structured Sources SEISD Acquisition of Lexical Knowledge for NLP
MRD1 LKB1 LDB1 Tax1 MLKB ... ... ... ... MRDn Taxn LKBn LDBn SEISDMethodology Acquisition of Lexical Knowledge for NLP
Problems following a pure descriptive approach Circularity Errors and inconsistencies Definitions with omitted genus Top dictionary senses do not usually represent useful knowledge for the LKB Too general Too specific SEISDMethodology Acquisition of Lexical Knowledge for NLP
Prescriptive approach Manual construction of the Type System SEISDMethodology Mixed Methodology Acquisition of Lexical Knowledge for NLP
Prescriptive approach Manual construction of the Type System Descriptive approach Acquiring implicit information from MRDs SEISDMethodology Mixed Methodology Acquisition of Lexical Knowledge for NLP
Prescriptive approach Manual construction of the Type System Descriptive approach Acquiring implicit information from MRDs SEISDMethodology Mixed Methodology Acquisition of Lexical Knowledge for NLP
Step 1: Selection of the main top beginners for a semantic primitive Step 2: Exploiting genus, construction of taxonomies Step 3: Exploiting differentia Step 4: Mapping the LK into the LKB Step 5: Tlinks Generation Step 6: Validation and exploitation of the LKB SEISDMethodology Acquisition of Lexical Knowledge for NLP
The System General Frame Methodology SEISD Application of the methodology Structured Sources SEISD Acquisition of Lexical Knowledge for NLP
SEISD: Sistema d’Extracció d’Informació Semàntica de Diccionaris (Ageno et al. 92) designed to support the main methodology taking into account the characteristics of the Lexical resources used reusability of software and lexical resources allowing modular improvements minimal effort Structured Sources SEISD Acquisition of Lexical Knowledge for NLP