Integrating Finite-state Morphologies with Deep LFG Grammars

Integrating Finite-state Morphologies with Deep LFG Grammars Tracy Holloway King

FST and deep grammars • Finite state tokenizers and morphologies can be integrated into deep processing systems • Integrated tokenizers • eliminate the need for preprocessing • allow the grammar writer more control over the input • Morphologies • eliminate the need to list (multiple) surface forms in the lexicon • eliminate the need for lexical entries for words with predictable subcategorization frames

Talk outline • Basic integrated system • Integrating morphology FSTs • Interaction of tokenization and morphology

Input string (Shallow markup) Tokenizing FSTs Morphology FSTs LFG grammar and lexicons Constituent-structure (tree) Functional-structure (AVM) Basic Architecture

Example steps through the system • Input string: Boys appeared. • Tokenizing: boys TB appeared TB . TB • Morphology: boy + Noun +Pl appear +Verb +PastBoth +123SP . +Punct • C-structure/F-structure: next slides

C-structure tree

F-structure AVM

The wider system: XLE • Handwritten grammars for various languages • Substantial for English, German, Japanese, Norwegian • Also: Arabic, Chinese, Urdu, Korean, Welsh, Malagasy, Turkish • Robustness mechanisms • Fragment grammar rules • Morphological guessers • Skimming when resource limits approached • Ambiguity management (packing) • Compute all analyses (no “aggressive pruning”) • Propagate packed ambiguities across processing modules • Stochastic disambiguation • MaxEnt models to select from packed (f-)structures • Other processing available: • generation, semantics, transfer/rewriting • Comparisons to other systems/tasks • Parsing WSJ (Riezler et al, ACL 2002) • Comparison to Collins model 3 (Riezler et al, NAACL 2004)

FST Morphologies • Associate surface form with • a lemma (stem/canonical form) • a set of tags • Process is non-deterministic • can have many analyses for one surface form • grammar has to be able to deal with multiple analyses (morphological ambiguity) • Issue: can the grammar control rampant morphological ambiguity? Arabic vowelless representations

Example Morphology Output • turnips <=> turnip +Noun +Pl • Mary <=> Mary +Prop +Giv +Fem +Sg • falls <=> fall +Noun +Pl fall +Verb +Pres +3sg • broken <=> break +Verb +PastPerf +123SP broken +Verb +PastPart } +Adj • New York <=> New York +Prop +Place +USAState +Prefer New York +Prop +Place +City +Prefer [ plus analyses of New and York ]

Morphologies and lexicons • Without a morphology, need to list all surface forms in the lexicon • bad for English • horrible for languages like Finnish and Arabic • With a morphology, one entry for the stem form go V XLE @(V-INTRANS go). for: go, goes, going, gone, went • With additional integration, words with predictable subcategorization frames need no entry

Basic idea • Run surface forms of words through the morphology to produce stems and tags • MorphConfig file specifies which morphologies the grammar uses • Look up stems and tags in the lexicon • Sublexical phrase structure rules build syntactic nodes covering the stems and tags • Standard grammar rules build larger phrases

Lexical entries for tags boys ==> boy +Noun +Pl boy N XLE @(NOUN boy). +Noun N_SFX XLE @(PERS 3) @(EXISTS NTYPE). +Pl NNUM_SFX XLE @(NUM pl).

N N_BASE boy N_SFX_BASE +Noun NNUM_SFX_BASE +Pl Sublexical rules for tags • Build up lexical nodes from stem plus tags • Rules are identical to standard phrase structure rules • Except display can hide the sublexical information • N --> N_BASE N_SFX_BASE NNUM_SFX_BASE.

PRED 'boy' PERS 3 NUM pl NTYPE common N N_BASE boy N_SFX_BASE +Noun NNUM_SFX_BASE +Pl Resulting structures

Lexical entries • Stems with unpredictable subcategorization frames need entries • verbs • adjectives with obliques (proud of her) • nouns with that complements (the idea that he laughed) • Most lexical items have predictable frames determined by part of speech • common and proper nouns • adjectives • adverbs • numbers

-unknown lexical entry • Match any stem to the entry • Provide desired functional information • %stem will pass in the appropriate surface form (i.e., the lemma/stem) • Constrain application via morphological tag possibilities • -unknown N XLE @(NOUN %stem); A XLE @(ADJ %stem); ADV XLE @(ADVERB %stem).

-unknown example • The box boxes. • Lexicon entries: box V XLE @(V-INTRANS %stem). -unknown N XLE @(NOUN %stem); ADV…; A... • Morphology output: box ==> box +Noun +Sg | +Verb +Non3Sg boxes ==> box +Noun +Pl | +Verb +3Sg • Build up four effective lexical entries • 1 noun, 1 verb, 1 adverb, 1 adjective • adverb and adjective fail sublexically • noun and verb relevant for the sentence

Inflectional morphology summary • Integrating FST morphologies significantly decreases lexicon development • Verbs and other unpredictable items are listed only under their stem form • Predictable items such as nouns are processed via –unknown and never listed in the lexicon

Guessers • Even large industrial FST morphologies are not complete • Novel words usually have regular morphology • Build and FST guesser based on this • Words with capital letters are proper nouns (Saakashvili) • Words ending in –ed are past tense verbs or deverbal adjectives • Guessed words will go through –unknown • no difference from standard morphological output • can add +Guessed tag for further control

Guessers: controlling application • Apply guesser in the grammar only if there is no form in the regular morphology • don't guess unless you have to • Control this with the MorphConfig • use multiple fst morphologies • stop looking once analysis if found

Sample MorphConfig STANDARD ENGLISH MORPHOLOGY (1.0) TOKENIZE: english.tok.parse.fst ANALYZE USEFIRST: english.infl.fst try regular morphology first english.guesser.fst if fail, guess MULTIWORD: english.standard.mwe.fst

Multiple morphology FSTs • In addition to the regular morphology and guesser, can have other morphologies • morphology for technical terms, part numbers, etc. • These can be applied in sequence or in parallel (cascaded or unioned) ANALYZE USEALL: english.infl.fst try regular morphology english.eureka.parts.fst and also part names

Morphology vs. surface form • System always allows surface form through • Lexicon can match this form for • multiword expressions • override/supplement morphological analysis • Example: or as adverb (Or you could leave now.) or ADV * @(ADVERB or); CONJ XLE @(CONJ or).

Tokenizers • Tokenizers break strings (sentences) into tokens (words) • Need to (for English): • break off punctuation Mary laughs. ==> Mary TB laughs TB . TB • lower case certain letters The dog ==> the TB dog

Tokenization and morphology • Linguistic analysis may govern tokenization • Are English contracted auxiliaries: • affixes: John'll ==> no tokenization John +Noun +Proper +Fut • clitics: John'll ==> John TB 'll TB John +Noun +Proper will +Fut • Arabic determiners and conjunctions • both written with adjacent words determiner as an affix giving +Def (Albint the-girl) conjunction tokenized separately (wakutub and-books)

Non-deterministic tokenizers: Punctuation • Cannot just break off punctuation and insert a TB • Comma haplology Find the dog, a poodle. ==> find TB the TB dog TB , TB a TB poodle TB , TB . TB • Period haplology Go to Palm Dr. ==> go TB to TB Palm TB Dr. TB . TB • Resulting tokenizer is non-deterministic • System must be able to handle multiple inputs

Capitalization • Intial capitals are optionally lower cased The boy left. ==> the boy left. Mary left. ==> Mary left. • Example for both types of non-determinism Bush saw them. ==> { Bush | bush } TB saw TB them TB [, TB]* . TB • Tokenization rules vary from language to language and by choice of linguistic analysis

Conclusions • System architecture integrates FST techniques with deep LFG parsing • tokenizers • morphologies and guessers • Allows generalizations to be factored out • properties of words • properties of strings • Allows use of existing large-scale lexical resources • avoids redundant speficication • System is actively in use in ParGram grammars

Shallow Markup • Preprocessing with shallow markup can reduce ambiguity and speed processing • Tokenizer must be able to process the markup • Part of speech tagging: • I/PRP_ saw/VBD_ her/PRP_ duck/VB_. • Named entities • <person>General Mills</person> bought it.

POS tagging • POS tags are not relevant for tokenizing, but the tokenizer must skip them • She walks/VBZ_. should be treated like She walks. • The morphology must only insert compatible tags • A mapping table states allowable combinations /VBZ_ +Verb +3sg /NN_ +Noun +Sg • These are encoded into a filtering FST • Only compatible tags are passed to the grammar

POS tagging example • I saw her duck duck +Noun +Sg duck +Verb +Pres +Non3sg • both possibilities passed to the grammar • I saw her duck/VB_. • only +Verb +Pres +Non3sg possibility is compatible with /VB_ POS tag • only this possibility is passed to the grammar

Named Entities • Named entities appear in text as XML markup <person>General Mills</person> bought it. • Tokenizer • creates special tag for these • puts literal spaces instead of TBs • allows version without markup for fallback General Mills TB +NamedEntity TB General TB +Title TB Mills +Proper TB • Lexical entry added for +NamedEntity • Sublexical N and NAME rules allows the tag

Sample Named Entity output

Integrating Finite-state Morphologies with Deep LFG Grammars