440 likes | 450 Views
Explore deep and shallow grammars, POS tagging, Chunking, Treebank grammars, Deep analysis benefits, and overcoming challenges in computational linguistics. Learn about practical applications, challenges, and solutions.
E N D
Issues in Computational Linguistics:Grammar Engineering Dick Crouch and Tracy King
Outline • What is a deep grammar? • How to engineer them: • robustness • integrating shallow resources • ambiguity • writing efficient grammars • real world data
What is a shallow grammar • often trained automatically from marked up corpora • part of speech tagging • chunking • trees
POS tagging and Chunking • Part of speech tagging: I/PRP saw/VBD her/PRP duck/VB./PUNCT I/PRP saw/VBD her/PRP$ duck/NN./PUNCT • Chunking: • general chunking [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]. (Abney) • NP chunking [NP President Clinton] visitited [NP the Hermitage] in [NP Leningrad]
Treebank grammars • Phrase structure tree (c-structure) • Annotations for heads, grammatical functions Collins parser output
Deep grammars • Provide detailed syntactic/semantic analyses • LFG (ParGram), HPSG (LinGO, Matrix) • Grammatical functions, tense, number, etc. Mary wants to leave. subj(want~1,Mary~3) comp(want~1,leave~2) subj(leave~2,Mary~3) tense(leave~2,present) • Usually manually constructed • linguistically motivated rules
Why would you want one • Meaning sensitive applications • overkill for many NLP applications • Applications which use shallow methods for English may not be able to for "free" word order languages • can read many functions off of trees in English SUBJ: NP sister to VP [S [NP Mary] [VP left]] OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]] • need other information in German, Japanese, etc.
shallow but wrong delegation furthest away but Subject of flew deep and right Deep analysis matters…if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division closest noun head next closest V.P. Philips next
Post-Search Sifting Google AutonomousKnowledge Filtering Alta Vista Technology Frontier AskJeeves KnowledgeFusion Useful Summary Good Translation NaturalDialogue Microsoft Paperclip Restricted Dialogue Document BaseManagement Manually-tagged Keyword Search Applications of Language Engineering Shallow Synthesis Broad Domain Coverage Narrow Deep Low High Functionality
Traditional Problems • Time consuming and expensive to write • Not robust • want output for any input • Ambiguous • Slow • Other gating items for applications that need deep grammars
Why deep analysis is difficult • Languages are hard to describe • Meaning depends on complex properties of words and sequences • Different languages rely on different properties • Errors and disfluencies • Languages are hard to compute • Expensive to recognize complex patterns • Sentences are ambiguous • Ambiguities multiply: explosion in time and space
How to overcome this • Engineer the deep grammars • theoretical vs. practical • what is good enough • Integrate shallow techniques into deep grammars • Experience based on broad-coverage LFG grammars (ParGram project)
Robustness: Sources of Brittleness • missing vocabulary • you can't list all the proper names in the world • missing constructions • there are many constructions theoretical linguistics rarely considers (e.g. dates, company names) • easy to miss even core constructions • ungrammatical input • real world text is not always perfect • sometimes it is really horrendous
Real world Input • Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13) • ``The croaker's done gone from the hook – (WSJ, section 13) • (SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip)
Missing vocabulary • Build vocabulary based on the input of shallow methods • fast • extensive • accurate • Finite-state morphologies • Part of Speech Taggers
Finite State Morphologies • Finite-state morphologies falls -> fall +Noun +Pl fall +Verb +Pres +3sg Mary -> Mary +Prop +Giv +Fem +Sg vienne -> venir +SubjP +SG {+P1|+P3} +Verb • Build lexical entry on-the-fly from the morphological information • have canonicalized stem form • have significant grammatical information • do not have subcategorization
Building lexical entries • Lexical entries -unknown N @(COMMON-NOUN %stem). +Noun N-SFX @(PERS 3). +Pl N-NUM @(NUM pl). • Rule NOUN -> N N-SFX N-NUM. • Templates • COMMON-NOUN :: (^ PRED)='%stem' (^ NTYPE)=common • PERS(3) :: (^ PERS)=3 • NUM(pl) :: (^ NUM)=pl
C-Structure for falls Noun N fall N-SFX +Noun N-NUM +Pl Building lexical entries • F-structure for falls • [ PRED 'fall' • NTYPE common • PERS 3 • NUM pl ]
Guessing words • Use FST guesser if the morphology doesn't know the word • Capitalized words can be proper nouns • Saakashvili -> Saakashvili +Noun +Proper +Guessed • ed words can be past tense verbs or adjectives • fumped -> fump +Verb +Past +Guessed fumped +Adj +Deverbal +Guessed • Languages with more morphology allow for better guessers
Using the lexicons • Rank the lexical lookup • overt entry in lexicon • entry built from information from morphology • entry built from information from guesser • Use the most reliable information • Fall back only as necessary
Missing constructions • Even large hand-written grammars are not complete • new constructions, especially with new corpora • unusual constructions • Generally longer sentences fail • one error can destroy the parse • Build up as much as you can; stitch together the pieces
Grammar engineering approach • First try to get a complete parse • If fail, build up chunks that get complete parses • Have a fall back for things without even chunk parses • Link these chunks and fall backs together in a single structure
Fragment Chunks: Sample output • the the dog appears. • Split into: • "token" the • sentence "the dog appears" • ignore the period
Ungrammatical input • Real world text contains ungrammatical input • typos • run ons • cut and paste errors • Deep grammars tend to only cover grammatical output • Two strategies • robustness techniques: guesser/fragments • disprefered rules for ungrammatical structures
Rules for ungrammatical structures • Common errors can be coded in the rules • want to know that error occurred (e.g., feature in f-structure) • Disprefer parses of ungrammatical structure • tools for grammar writer to rank rules • two+ pass system • standard rules • rules for known ungrammatical constructions • default fall back rules
Sample ungrammatical structures • Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr} • Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …}
Robustness summary • Integrate shallow methods • for lexical items • morphologies • guessers • Fall back techniques • for missing constructions • fragment grammar • disprefered rules
Ambiguity • Deep grammars are massively ambiguous • Example: 700 from section 23 of WSJ • average # of words: 19.6 • average # of optimal parses: 684 • for 1-10 word sentences: 3.8 • for 11-20 word sentences: 25.2 • for 50-60 word sentences: 12,888
Managing Ambiguity • Use packing to parse and manipulate the ambiguities efficiently (more tomorrow) • Trim early with shallow markup • fewer parses to choose from • faster parse time • Choose most probable parse for applications that need a single input
Shallow markup • Part of speech marking as filter I saw her duck/VB. • accuracy of tagger (v. good for English) • can use partial tagging (verbs and nouns) • Named entities • <company>Goldman, Sachs & Co.</company> bought IBM. • good for proper names and times • hard to parse internal structure • Fall back technique if fail • slows parsing • accuracy vs. speed
TB arrived TB . TB Example shallow markup: Named entities • Allow tokenizer to accept marked up input: parse {<person>Mr. Thejskt Thejs</person> arrived.} tokenized string: Mr. Thejskt ThejsTB +NEperson Mr(TB). TB Thejskt TB Thejs • Add lexical entries and rules for NE tags
Results for shallow markup Kaplan and King 2003
Chosing the most probable parse • Applications may want one input • or at least just a handful • Use stochastic methods to choose • efficient (XLE English grammar: 5% of parse time) • Need training data • partially labelled data ok [NP-SBJ They] see [NP-OBJ the girl with the telescope]
Run-time performance • Many deep grammars are slow • Techniques depend on the system • LFG: exploit the context-free backbone ambiguity packing techniques • Speed vs. accuracy trade off • remove/disprefer peripheral rules • remove fall backs for shallow markup
Development expense • Grammar porting • Starter grammar • Induced grammar bootstrapping • How cheap are shallow grammars? • training data can be expensive to produce
Grammar porting • Use an existing grammar as the base for a new language • Languages must be typologically similar • Japanese-Korean • Balkan • Lexical porting via bi-lingual dictionaries • Main work in testing and evaluation
Starter grammar • Provide basic rules and templates • including for robustness techniques • Grammar writer: • chooses among them • refines them • Grammar Matrix for HPSG
Grammar induction • Induce a core grammar from a treebank • compile rule generalizations • threshold rare rules • hand augment with features and fallback techniques • Requires • induction program • existing resources (treebank)
Conclusions • Grammar engineering makes deep grammars feasible • robustness techniques • integration of shallow methods • Many current applications can use shallow grammars • Fast, accurate, broad-coverage deep grammars enable new applications