560 likes | 633 Views
CPSC 503 Computational Linguistics. Lecture 7 Giuseppe Carenini. Today 29/9. Finish POS tagging Start Syntax / Parsing (Chp 12!). Tagger. PoS Tagging. Input text. Brainpower, not physical plant, is now a firm's chief asset. ………….
E N D
CPSC 503Computational Linguistics Lecture 7 Giuseppe Carenini CPSC503 Winter 2008
Today 29/9 • Finish POS tagging • Start Syntax / Parsing (Chp 12!) CPSC503 Winter 2008
Tagger PoS Tagging Input text • Brainpower, not physical plant, is now a firm's chief asset. ………… • Brainpower_NN ,_, not_RB physical_JJ plant_NN ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN ._. ………. Dictionary wordi -> set of tags from Tagset Output CPSC503 Winter 2008
Tagger Types • Rule-based ‘95 • Stochastic • HMM tagger ~ >= ’92 • Transformation-based tagger (Brill) ~ >= ’95 • Maximum Entropy Models ~ >= ’97 CPSC503 Winter 2008
Sample Constraint Step 1: sample I/O Example: Adverbial “that” ruleGiven input: “that”If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A)Then eliminate non-ADV tagsElse eliminate ADV “Pavlov had show that salivation….” PavlovN SG PROPER hadHAVE V PAST SVO HAVE PCP2 SVO shownSHOW PCP2 SVOO …… thatADV PRON DEM SG CS …….. ……. Rule-Based (ENGTWOL ’95-99) • A lexicon transducer returns for each word all possible morphological parses • A set of ~3,000 constraints is applied to rule out inappropriate PoS CPSC503 Winter 2008
HMM Stochastic Tagging • Tags corresponds to an HMMstates • Words correspond to the HMM alphabet symbols Tagging: given a sequence of words (observations), find the most likely sequence of tags (states) But this is…..! We need: State transition and symbol emission probabilities 1) From hand-tagged corpus 2) No tagged corpus: parameter estimation (Baum-Welch) CPSC503 Winter 2008
Evaluating Taggers • Accuracy: percent correct (most current taggers 96-7%) *test on unseen data!* • Human Celing: agreement rate of humans on classification (96-7%) • Unigram baseline: assign each token to the class it occurred in most frequently in the training set (race -> NN). (91%) • What is causing the errors? Build a confusion matrix… CPSC503 Winter 2008
Confusion matrix • Look at a confusion matrix • Precision ? • Recall ? Speech and Language Processing - Jurafsky and Martin
Error Analysis (textbook) • Look at a confusion matrix • See what errors are causing problems • Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) • Preterite (VBD) vs Participle (VBN) vs Adjective (JJ) Speech and Language Processing - Jurafsky and Martin
Today 29/9 • Finish POS tagging • English Syntax • Context-Free Grammar for English • Rules • Trees • Recursion • Problems • Start Parsing CPSC503 Winter 2008
Knowledge-Formalisms Map(next three lectures) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2008
Syntax Def. The study of how sentences are formed by grouping and ordering words Example: Ming and Sue prefer morning flights * Ming Sue flights morning and prefer Groups behave as single unit wrt Substitution, Movement, Coordination CPSC503 Winter 2008
Syntactic Notions so far... • N-grams: prob. distr. for next word can be effectively approximated knowing previous n words • POS categories are based on: • distributional properties (what other words can occur nearby) • morphological properties (affixes they take) CPSC503 Winter 2008
Syntax: Useful tasks • Why should you care? • Grammar checkers • Basis for semantic interpretation • Question answering • Information extraction • Summarization • Machine translation • …… CPSC503 Winter 2008
Grammars and Constituency • Of course, there’s nothing easy or obvious about how we come up with right set of constituents and the rules that govern how they combine... • That’s why there are so many different theories of grammar and competing analyses of the same data. • The approach to grammar, and the analyses, adopted here are very generic (and don’t correspond to any modern linguistic theory of grammar). From Speech and Language Processing - Jurafsky and Martin
Key Constituents – with heads(English) (Specifier) X (Complement) • (Det) N (PP) • (Qual) V (NP) • (Deg) P (NP) • (Deg) A (PP) • (NP) (I) (VP) • Noun phrases • Verb phrases • Prepositional phrases • Adjective phrases • Sentences Some simplespecifiers Category Typical function Examples Determiner specifier of N the, a, this, no.. Qualifier specifier of V never, often.. Degree word specifier of A or P very, almost.. Complements? CPSC503 Winter 2008
Key Constituents: Examples • (Det) N (PP) • the cat on the table • (Qual) V (NP) • never eat a cat • (Deg) P (NP) • almost in the net • (Deg) A (PP) • very happy about it • (NP) (I) (VP) • a mouse -- ate it • Noun phrases • Verb phrases • Prepositional phrases • Adjective phrases • Sentences CPSC503 Winter 2008
Start-symbol Context Free Grammar (Example) • S -> NP VP • NP -> Det NOMINAL • NOMINAL -> Noun • VP -> Verb • Det -> a • Noun -> flight • Verb -> left • Non-terminal • Terminal CPSC503 Winter 2008
CFG more complex Example • Grammar with example phrases • Lexicon CPSC503 Winter 2008
CFGs • Define a Formal Language (un/grammatical sentences) • Generative Formalism • Generate strings in the language • Reject strings not in the language • Impose structures (trees) on strings in the language CPSC503 Winter 2008
CFG: Formal Definitions • 4-tuple (non-term., term., productions, start) • (N, , P, S) • P is a set of rules A; AN, (N)* • A derivation is the process of rewriting 1into m (both strings in (N)*) by applying a sequence of rules: 1* m • L G = W|w* and S * w CPSC503 Winter 2008
Nominal Nominal flight Derivations as Trees Context Free? CPSC503 Winter 2008
I prefer a morning flight Nominal Nominal Parser flight CFG Parsing • It is completely analogous to running a finite-state transducer with a tape • It’s just more powerful • Chpt. 13 CPSC503 Winter 2008
Other Options • Regular languages (FSA) A xB or A x • Too weak (e.g., cannot deal with recursion in a general way – no center-embedding) • CFGs A (also produce more understandable and “useful” structure) • Context-sensitive A ; ≠ • Can be computationally intractable • Turing equiv. ; ≠ • Too powerful / Computationally intractable CPSC503 Winter 2008
Common Sentence-Types • Declaratives: A plane left S -> NP VP • Imperatives: Leave! S -> VP • Yes-No Questions: Did the plane leave? S -> Aux NP VP • WH Questions: Which flights serve breakfast? S -> WH NP VP When did the plane leave? S -> WH Aux NP VP CPSC503 Winter 2008
NP: more details • NP -> (Predet)(Det)(Card)(Ord)(Quant) (AP) Nom • e.g., all the other cheap cars NP -> Specifiers N Complements • Nom -> Nom PP (PP) (PP) • e.g., reservation on BA456 from NY to YVR Nom -> Nom GerundVP e.g., flight arriving on Monday Nom -> Nom RelClause Nom RelClause ->(who | that) VP e.g., flight that arrives in the evening CPSC503 Winter 2008
Conjunctive Constructions • S -> S and S • John went to NY and Mary followed him • NP -> NP and NP • John went to NY and Boston • VP -> VP and VP • John went to NY and visited MOMA • … • In fact the right rule for English is • X -> X and X CPSC503 Winter 2008
Problems with CFGs • Agreement • Subcategorization CPSC503 Winter 2008
Agreement • In English, • Determiners and nouns have to agree in number • Subjects and verbs have to agree in person and number • Many languages have agreement systems that are far more complex than this (e.g., gender). CPSC503 Winter 2008
This dog Those dogs This dog eats You have it Those dogs eat *This dogs *Those dog *This dog eat *You has it *Those dogs eats Agreement CPSC503 Winter 2008
S -> NP VP NP -> Det Nom VP -> V NP … SgS -> SgNP SgVP PlS -> PlNp PlVP SgNP -> SgDet SgNom PlNP -> PlDet PlNom PlVP -> PlV NP SgVP ->SgV NP … Possible CFG Solution OLD Grammar NEW Grammar Sg = singular Pl = plural CPSC503 Winter 2008
CFG Solution for Agreement • It works and stays within the power of CFGs • But it doesn’t scale all that well (explosion in the number of rules) CPSC503 Winter 2008
Subcategorization • Def. It expresses constraints that a predicate (verb here) places on the number and type of its arguments(see first table) • *John sneezed the book • *I prefer United has a flight • *Give with a flight CPSC503 Winter 2008
Subcategorization • Sneeze: John sneezed • Find: Please find [a flight to NY]NP • Give: Give [me]NP[a cheaper fare]NP • Help: Can you help [me]NP[with a flight]PP • Prefer: I prefer [to leave earlier]TO-VP • Told: I was told [United has a flight]S • … CPSC503 Winter 2008
So? • So the various rules for VPs overgenerate. • They allow strings containing verbs and arguments that don’t go together • For example: • VP -> V NP therefore Sneezed the book • VP -> V S therefore go she will go there CPSC503 Winter 2008
Possible CFG Solution OLD Grammar NEW Grammar • VP -> V • VP -> V NP • VP -> V NP PP • … • VP -> IntransV • VP -> TransV NP • VP -> TransPPto NP PPto • … • TransPPto -> hand,give,.. This solution has the same problem as the one for agreement CPSC503 Winter 2008
CFG for NLP: summary • CFGs cover most syntactic structure in English. • But there are problems (overgeneration) • That can be dealt with adequately, although not elegantly, by staying within the CFG framework. • There are simpler, more elegant, solutions that take us out of the CFG framework: LFG, XTAGS… Chpt 15 “Features and Unification” CPSC503 Winter 2008
Dependency Grammars • Syntactic structure: binary relations between words • Links: grammatical function or very general semantic relation • Abstract away from word-order variations (simpler grammars) • Useful features in many NLP applications (for classification, summarization and NLG) CPSC503 Winter 2008
Today 29/9 • English Syntax • Context-Free Grammar for English • Rules • Trees • Recursion • Problems • Start Parsing CPSC503 Winter 2008
Nominal Nominal flight Parsing with CFGs • Valid parse trees • Sequence of words Assign valid trees: covers all and only the elements of the input and has an S at the top I prefer a morning flight Parser CFG CPSC503 Winter 2008
CFG Parsing as Search • Search space of possible parse trees • S -> NP VP • S -> Aux NP VP • NP -> Det Noun • VP -> Verb • Det -> a • Noun -> flight • Verb -> left, arrive • Aux -> do, does Parsing: find all trees that cover all and only the words in the input • defines CPSC503 Winter 2008
Nominal Nominal flight Constraints on Search • Sequence of words • Valid parse trees Search Strategies: • Top-down or goal-directed • Bottom-up or data-directed I prefer a morning flight Parser CFG (search space) CPSC503 Winter 2008
Input: flight Top-Down Parsing • Since we’re trying to find trees rooted with an S (Sentences) start with the rules that give us an S. • Then work your way down from there to the words. CPSC503 Winter 2008
…….. • …….. • …….. Next step: Top Down Space • When POS categories are reached, reject trees whose leaves fail to match all words in the input CPSC503 Winter 2008
flight flight flight Bottom-Up Parsing • Of course, we also want trees that cover the input words. So start with trees that link up with the words in the right way. • Then work your way up from there. CPSC503 Winter 2008
…….. • …….. • …….. flight flight flight flight flight flight flight Two more steps: Bottom-Up Space CPSC503 Winter 2008
Top-Down vs. Bottom-Up • Top-down • Only searches for trees that can be answers • But suggests trees that are not consistent with the words • Bottom-up • Only forms trees consistent with the words • Suggest trees that make no sense globally CPSC503 Winter 2008
So Combine Them • Top-down: control strategy to generate trees • Bottom-up: to filter out inappropriate parses • Top-down Control strategy: • Depth vs. Breadth first • Which node to try to expand next • Which grammar rule to use to expand a node • (left-most) • (textual order) CPSC503 Winter 2008
Top-Down, Depth-First, Left-to-Right Search Sample sentence: “Does this flight include a meal?” CPSC503 Winter 2008
Example “Does this flight include a meal?” CPSC503 Winter 2008