1.59k likes | 2.24k Views
MultiWord Expressions in NLP. Jan Odijk LOT Summerschool Utrecht, June 2004. Overview. NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon. Overview. NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks
E N D
MultiWord Expressions in NLP Jan Odijk LOT Summerschool Utrecht, June 2004
Overview • NLP • MWEs • MWEs in NLP • MWE Types • Treatment of MWEs in selected frameworks • MWEs and the lexicon
Overview • NLP • MWEs • MWEs in NLP • MWE Types • Treatment of MWEs in selected frameworks • MWEs and the lexicon
Natural Language Processing • Automatic processing of natural language • Generation: Semantic Repr String • Analysis: String Semantic Representation • Example applications • Machine Translation (MT) • Information Retrieval (IR) • Cross-language Information Retrieval (CLIR) • Question-Answering
Natural Language Processing • Based on Grammars • (Popular) frameworks • Feature structure based • Head-driven Phrase Structure Grammar (HPSG) • Lexical-Functional Grammar (LFG) • Tree-based • Tree-Adjoining Grammar (TAG) • M-Grammar • Based on grammar components or dedicated modules • Decompounding • PoS-tagging • Chunking • Named Entity Recognition • Name/Address grammars • Date / Amount grammars
Natural Language Processing • Based on Statistics • No explicit grammar • Statistics • Derived from (annotated) training corpus • Tested with test corpus • Applied to new corpora • Combinations of grammar and statistics
NLP Grammar • Defines <form, meaning> pairs and structural descriptions at various levels • Components • Semantics • Syntax • Morphology • Orthography (Phonology)
NLP Grammar • Semantics • Defines the meaning of an utterance • usually synchronized with syntax (compositionality) • HPSG: CONTENTS attribute • M-Grammar: in-tandem build up • Synchronous TAG: in-tandem build-up with derivation trees • LFG: in tandem with f-structure
NLP Grammar • Syntax • Defines the syntactic structure of an utterance • Object types: Trees, DAGs • Features: attribute-value pairs • Value: atomic or structured
NLP Grammar • Syntax • Often surface syntax and deep syntax (not necessarily on a separate level) • HPSG: surface tree v. DAG • M-Grammar: surface trees v. derivation trees • LFG: c-structure v. f-structure • TAG: derived tree v. derivation tree • Alpino: surface tree v. dependency tree
NLP Grammar • Morphology • Relates (word structure, string) • Word-internal structure build-up usually in the syntactic component • Usually a rule system (intensional definition) • Simple Inflection: sometimes list of triples <base form, morph prop, word form> (extensional definition)
NLP Grammar • Orthography • Relates ([String], String) • [he, said, :, “, come, in, !, “] • He said: “come in!” • Usually trivial in generation • Easy in analysis (tokenization) for many languages • Sometimes split (erop, opgebeld) • Very problematic for Chinese, Japanese, etc.
Overview • NLP • MWEs • MWEs in NLP • MWE Types • Treatment of MWEs in selected frameworks • MWEs and the lexicon
What are MWEs? • sequence of words that has lexical, orthographic, phonological, morphological, syntactic, semantic, pragmatic or translational properties not predictable from the individual components or their normal mode of combination
What are MWEs? • sequence of • Not necessarily contiguous in a concrete utterance • ...omdat hij de plaat wilde poetsen • Not necessarily always in the same order in each utterance • Hij poetste gisteren de plaat • words • Ambiguity between type and token (intentional) • Inflected word form v. lemma • Ambiguity between • Character sequences separated from other character sequences by spaces and other separators (Narrow interpretation) • Abstract lexical units of the grammar (Broad interpretation)
What are MWEs? • that has properties not predictable from the individual components and their normal mode of combination
What are MWEs? • Lexical • De plaatpoetsen • Een poging wagen / doen / *maken • Dat varkentje eens wassen • Zware / *sterke shag • Scherpe kritiek • Perdre la tête/ la boule / *la cervelle • Se creuser la tête / * la boule / la cervelle
What are MWEs? • Orthographic • viz. • Bijv. • www.uilots.nl • i.v.m. • Yahoo! • Groen! • Aujourd’hui (v. l’homme) • ‘s (avonds/morgens/middags)
What are MWEs? • phonological, • Over de rooie/*rode (gaan/zijn/raken) • om de dooie/*dode donder niet • op zijn dooie akkertje/gemak • op zijn dooie eentje • De kwaaie/*kwade Piet toegespeeld krijgen • Je niet in de kouwe/*koude kleren gaan zitten • Een gouwe ouwe • (but geen rode/rooie cent/duit (hebben))
What are MWEs? • morphological, • Ten gevolge van • Ter wereld • Van goeden huize • Zonder aanzien des persoons • Het lood*(je) leggen • Dat varken*(tje) wassen • De *raap is / rapen zijn gaar
What are MWEs? • Syntactic • Ten gevolge van • In opdracht van (no article) • Iemand een oor aannaaien • Rekening houden met (obligatorily indefinite) • Het bijvoeglijk(*e) naamwoord (v. een groot/grote man)
What are MWEs? • Semantic • De plaat poetsen • Dat varkentje wassen • Een bok schieten • Een flater slaan
What are MWEs? • Pragmatic • Ladies and Gentlemen • Ik heb gezegd. • Eet smakelijk! (Bon appétit!, Enjoy!) • Sincerely yours
What are MWEs? • Translational properties • Laten zien (F. montrer, E. show) • Witte wijn (P. vinho verde) • Nuclear power plant (D. atoomcentrale, G. Kernkraftwerk) • Space probe (F. sonde spatiale) • Iemand iets laten weten • inform someone of something
Overview • NLP • MWEs • MWEs in NLP • MWE Types • Treatment of MWEs in selected frameworks • MWEs and the lexicon
MWEs in NLP • MWEs occur very often in natural language • Esp. in languages with little compounding • Especially in specialized domains • Multi-word terminology
MWEs in NLP • MT • Improves parsing and translation of the MWEs • Also improves parsing hence translation of the sentence containing the MWEs (Nivre & Nilsson LREC 2004) • CLIR • Nuclear power plant • Kern- macht plant • Kern- Macht Pflanz • v. atoomcentrale / Kernkraftwerk
MWEs in NLP • Problems MWEs pose for NLP • How are MWEs to be dealt with in the grammar of an NLP system? • What lexical representation of MWEs is required for this? • How can we obtain lexicons containing MWEs with such lexical representations
Overview • NLP • MWEs • MWEs in NLP • MWE Types • Treatment of MWEs in selected frameworks • MWEs and the lexicon
Types of MWEs (I) • Fixed • Semi-flexible • Flexible
Fixed MWEs • Fixed MWEs • Words of the MWE in a fixed order • No variation in lexical item choice • Always contiguous (no other elements in between) • No inflectional processes except at the edges
Fixed MWEs • Fixed MWEs • ad hoc, stante pede, ter plaatse • Hong Kong, Kuala Lumpur, New York, San Francisco • credit card, travel agency, real estate agency • NOT • in plaats van (cf. in plaats daarvan) (‘instead of’) • carta telefonica (cf. carte telefoniche) • de plaat poetsen (‘polish the plate’, ‘bolt’)
Semi-Flexible MWEs • Semi-Flexible MWEs • MWEs with fixed order of elements • That are impenetrable for other words • Parts can be inflected
Semi-Flexible MWEs • Examples: • Chambre des représentants • House of representatives • Patatas fritas • French fries • Mise au point automatique • Autofocus • Calculateur analogique • Analogue computer
Semi-Flexible MWEs • Examples: • Cité plus haut • Above-stated • Résistant aux acides • Acid-proof • Malade en altitude • Airsick
Flexible MWEs • Flexible MWEs • Allow or require inflection in multiple parts, and • Allow permutations of subphrases, or • Allow intrusion by other phrases, or • Have controlled variation (bound pronouns)
Flexible MWEs • de plaat poetsen (‘bolt’) • Hij heeft gisteren de plaat gepoetst • …omdat hij de plaat wilde poetsen • Hij poetste gisteren de plaat • to lose one’s temper • He lost his temper • She lost her temper
Treatment • Fixed MWEs • No inflection: Relate single string to sequence of strings (in Orthography) • ([ad_hoc] , [ad, hoc]) • Lexical entry for ad_hoc • With inflection: Relate single stem to sequence of stems in Morphology • ([real, estate, agency, Plur] -> [real_estate_agency, Plur]) • Lexical entry for real_estate_agency
Treatment • Semi-flexible MWEs • Require local syntax • Chunking may be enough
Treatment • Flexible MWEs • Require sophisticated syntax
Types of MWEs (II) • Verb –particle combinations (English, German, Dutch, Hungarian) • Ik sloeg hem over • I looked the passage up
Types of MWEs (II) • Verb + prepositional complement • I looked after her • Hij heeft altijd van haar gehouden
Types of MWEs (II) • Circumpositions (Dutch, German) • Op iemand af / ?toe / *heen • Auf jemanden *ab / zu • Over de brug heen / *af / *toe
Types of MWEs (II) • Lexical item (from open or closed class) • + closed class lexical item • Finite (actually small) list • Limited variety of predictable syntactic structures • Dealt with by almost any grammar-based NLP system
Types of MWEs (II) • Multiword Names • Examples • Fifth Avenue • Koning Leopold III-laan • Krimpen aan de IJssel • Koninklijke Nederlandse Philips N.V.
Types of MWEs (II) • Multiword Names • Issues • Keys – variation • (Koning) Leopold III-laan • Fifth (Avenue) • ((Calle) Roberto) González • Many different ones, continuously new ones • Very important for correct parsing and translation • Minister Kohl Minister Cabbage
Types of MWEs(II) • Compounds (in English) • Examples • Real estate agency • Nuclear power plant • Blue cheese • Private eye • High school
Types of MWEs(II) • Idioms • No or unpredictable meaning of the components • Fixed (or very limited ) lexical item selection • Opaque • Kick the bucket • De plaat poetsen • Casser sa pipe
Types of MWEs(II) • Idioms • Semi-transparant • `een bok schieten’ • Bok (male goat) = blunder • Schieten (shoot) = make • `dat varkentje wassen’ • Varkentje (little pig) = problem • Wassen (wash) = address, take care of