690 likes | 711 Views
CPSC 503 Computational Linguistics. Lecture 8 Giuseppe Carenini. Knowledge-Formalisms Map. State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models ). Morphology. Logical formalisms (First-Order Logics). Syntax.
E N D
CPSC 503Computational Linguistics Lecture 8 Giuseppe Carenini CPSC503 Winter 2008
Knowledge-Formalisms Map State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2008
Today 1/10 • Finish CFG for Syntax of NL (problems) • Parsing • The Earley Algorithm • Partial Parsing: Chuncking • Dependency Grammars / Parsing CPSC503 Winter 2008
Problems with CFGs • Agreement • Subcategorization CPSC503 Winter 2008
Agreement • In English, • Determiners and nouns have to agree in number • Subjects and verbs have to agree in person and number • Many languages have agreement systems that are far more complex than this (e.g., gender). CPSC503 Winter 2008
This dog Those dogs This dog eats You have it Those dogs eat *This dogs *Those dog *This dog eat *You has it *Those dogs eats Agreement CPSC503 Winter 2008
S -> NP VP NP -> Det Nom VP -> V NP … SgS -> SgNP SgVP PlS -> PlNp PlVP SgNP -> SgDet SgNom PlNP -> PlDet PlNom PlVP -> PlV NP SgVP ->SgV NP … Possible CFG Solution OLD Grammar NEW Grammar Sg = singular Pl = plural CPSC503 Winter 2008
CFG Solution for Agreement • It works and stays within the power of CFGs • But it doesn’t scale all that well (explosion in the number of rules) CPSC503 Winter 2008
Subcategorization • Def. It expresses constraints that a predicate (verb here) places on the number and type of its arguments(see first table) • *John sneezed the book • *I prefer United has a flight • *Give with a flight CPSC503 Winter 2008
Subcategorization • Sneeze: John sneezed • Find: Please find [a flight to NY]NP • Give: Give [me]NP[a cheaper fare]NP • Help: Can you help [me]NP[with a flight]PP • Prefer: I prefer [to leave earlier]TO-VP • Told: I was told [United has a flight]S • … CPSC503 Winter 2008
So? • So the various rules for VPs overgenerate. • They allow strings containing verbs and arguments that don’t go together • For example: • VP -> V NP therefore Sneezed the book • VP -> V S therefore go she will go there CPSC503 Winter 2008
Possible CFG Solution OLD Grammar NEW Grammar • VP -> V • VP -> V NP • VP -> V NP PP • … • VP -> IntransV • VP -> TransV NP • VP -> TransPPto NP PPto • … • TransPPto -> hand,give,.. This solution has the same problem as the one for agreement CPSC503 Winter 2008
CFG for NLP: summary • CFGs cover most syntactic structure in English. • But there are problems (overgeneration) • That can be dealt with adequately, although not elegantly, by staying within the CFG framework. • There are simpler, more elegant, solutions that take us out of the CFG framework: LFG, XTAGS… see Chpt 15 “Features and Unification” CPSC503 Winter 2008
Today 1/10 • Finish CFG for Syntax of NL (problems) • Parsing • The Earley Algorithm • Partial Parsing: Chuncking • Dependency Grammars / Parsing CPSC503 Winter 2008
Nominal Nominal flight Parsing with CFGs • Valid parse trees • Sequence of words Assign valid trees: covers all and only the elements of the input and has an S at the top I prefer a morning flight Parser CFG CPSC503 Winter 2008
CFG Parsing as Search • Search space of possible parse trees • S -> NP VP • S -> Aux NP VP • NP -> Det Noun • VP -> Verb • Det -> a • Noun -> flight • Verb -> left, arrive • Aux -> do, does Parsing: find all trees that cover all and only the words in the input • defines CPSC503 Winter 2008
Nominal Nominal flight Constraints on Search • Sequence of words • Valid parse trees Search Strategies: • Top-down or goal-directed • Bottom-up or data-directed I prefer a morning flight Parser CFG (search space) CPSC503 Winter 2008
Input: flight Top-Down Parsing • Since we’re trying to find trees rooted with an S (Sentences) start with the rules that give us an S. • Then work your way down from there to the words. CPSC503 Winter 2008
…….. • …….. • …….. Next step: Top Down Space • When POS categories are reached, reject trees whose leaves fail to match all words in the input CPSC503 Winter 2008
flight flight flight Bottom-Up Parsing • Of course, we also want trees that cover the input words. So start with trees that link up with the words in the right way. • Then work your way up from there. CPSC503 Winter 2008
…….. • …….. • …….. flight flight flight flight flight flight flight Two more steps: Bottom-Up Space CPSC503 Winter 2008
Top-Down vs. Bottom-Up • Top-down • Only searches for trees that can be answers • But suggests trees that are not consistent with the words • Bottom-up • Only forms trees consistent with the words • Suggest trees that make no sense globally CPSC503 Winter 2008
So Combine Them • Top-down: control strategy to generate trees • Bottom-up: to filter out inappropriate parses • Top-down Control strategy: • Depth vs. Breadth first • Which node to try to expand next • Which grammar rule to use to expand a node • (left-most) • (textual order) CPSC503 Winter 2008
Top-Down, Depth-First, Left-to-Right Search Sample sentence: “Does this flight include a meal?” CPSC503 Winter 2008
Example “Does this flight include a meal?” CPSC503 Winter 2008
Example “Does this flight include a meal?” flight flight CPSC503 Winter 2008
Example “Does this flight include a meal?” flight flight CPSC503 Winter 2008
Adding Bottom-up Filtering The following sequence was a waste of time because an NP cannot generate a parse tree starting with an Aux Aux Aux Aux Aux CPSC503 Winter 2008
Aux Aux Aux Bottom-Up Filtering CPSC503 Winter 2008
Problems with TD-BU-filtering • Left recursion • Ambiguity • Repeated Parsing • SOLUTION: Earley Algorithm • (once again dynamic programming!) CPSC503 Winter 2008
(1) Left-Recursion These rules appears in most English grammars S -> S and S VP -> VP PP NP -> NP PP CPSC503 Winter 2008
(2) Structural Ambiguity Three basic kinds: Attachment/Coordination/NP-bracketing “I shot an elephant in my pajamas” CPSC503 Winter 2008
(3) Repeated Work • Parsing is hard, and slow. It’s wasteful to redo stuff over and over and over. • Consider an attempt to top-down parse the following as an NP “A flight from Indi to Houston on TWA” CPSC503 Winter 2008
starts from…. • NP -> Det Nom • NP-> NP PP • Nom -> Noun • …… • fails and backtracks flight CPSC503 Winter 2008
restarts from…. • NP -> Det Nom • NP-> NP PP • Nom -> Noun • fails and backtracks flight CPSC503 Winter 2008
restarts from…. • fails and backtracks.. flight CPSC503 Winter 2008
restarts from…. • Success! CPSC503 Winter 2008
4 • But…. • 3 • 2 • 1 CPSC503 Winter 2008
Dynamic Programming Fills tables with solution to subproblems • Parsing: sub-trees consistent with the input, once discovered, are stored and can be reused • Does not fall prey to left-recursion • Stores ambiguous parse compactly • Does not do (avoidable) repeated work CPSC503 Winter 2008
Earley Parsing O(N3) • Fills a table in a single sweep over the input words • Table is length N +1; N is number of words • Table entries represent: • Predicted constituents • In-progress constituents • Completed constituents and their locations CPSC503 Winter 2008
For Next Time • Read 12.7 • Read in Chapter 13 (Parsing): 13.4.2, 13.5 • Optional: Read Chapter 16 (Features and Unification) – skip algorithms and implementation CPSC503 Winter 2008
Final Project: Decision • Two ways: Select and NLP task / problem or a technique used in NLP that truly interests you • Tasks: summarization of …… , computing similarity between two terms/sentences (skim through the textbook) • Techniques: extensions / variations / combinations of what we saw in class – Max Entropy Classifiers or MM, Dirichlet Multinomial Distributions CPSC503 Winter 2008
Final Project: goals (and hopefully contributions ) • Improve on a proposed solution by using a possibly more effective technique or by combining multiple techniques • Proposing a novel (minimally is OK) different solution. • Apply a technique which has been used for nlp taskA to a different nlp taskB. • Apply a technique to a different dataset or to a different language • Proposing a different evaluation measure CPSC503 Winter 2008
Final Project: Examples / Ideas • Look on the course WebPage CPSC503 Winter 2008
Today 1/10 • Finish CFG for Syntax of NL • Parsing • The Earley Algorithm • Partial Parsing: Chuncking CPSC503 Winter 2008
States The table-entries are called states and express: • what is predicted from that point • what has been recognized up to that point Representation: dotted-rules + location S -> · VP [0,0] A VP is predicted at the start of the sentence NP -> Det · Nominal [1,2] An NP is in progress; the Det goes from 1 to 2 VP -> V NP ·[0,3] A VP has been found starting at 0 and ending at 3 CPSC503 Winter 2008
Graphically • S -> · VP [0,0] • NP -> Det · Nominal [1,2] • VP -> V NP ·[0,3] CPSC503 Winter 2008
Earley: answer • Answer found by looking in the table in the right place. • The following state should be in the final column: • S –> · [0,n] • i.e., an S state that spans from 0 to n and is complete. CPSC503 Winter 2008
Earley Parsing Procedure • So sweep through the table from 0 to n in order, applying one of three operators to each state: • predictor: add top-down predictions to the chart • scanner: read input and add corresponding state to chart • completer: move dot to right when new constituent found • Results (new states) added to current or next set of states in chart • No backtracking and no states removed CPSC503 Winter 2008
Predictor • Intuition: new states represent top-down expectations • Applied when non-part-of-speech non-terminals are to the right of a dot S --> • VP [0,0] • Adds new states to end of current chart • One new state for each expansion of the non-terminal in the grammar VP --> • V [0,0] VP --> • V NP [0,0] CPSC503 Winter 2008