510 likes | 609 Views
Statistical NLP Winter 2009. Lecture 10: Parsing I. Roger Levy Thanks to Jason Eisner & Dan Klein for slides. Why is natural language parsing hard?. As language structure gets more abstract, computing it gets harder Document classification finite number of classes
E N D
Statistical NLPWinter 2009 Lecture 10: Parsing I Roger Levy Thanks to Jason Eisner & Dan Klein for slides
Why is natural language parsing hard? • As language structure gets more abstract, computing it gets harder • Document classification • finite number of classes • fast computation at test time • Part-of-speech tagging (recovering label sequences) • Exponentially many possible tag sequences • But exact computation possible in O(n) • Parsing (recovering labeled trees) • Exponentially many, or even infinite, possible trees • Exact inference worse than tagging, but still within reach
Why is parsing harder than tagging • How many trees are there for a given string? • Imagine a rule VPVP • …∞! • This is not a problem for inferring availability of structures (why?) • Nor is this a problem for inferring the most probable structure in a PCFG (why?)
Why parsing is harder than tagging II • Ingredient 1: syntactic category ambiguity • Exponentially many category sequences, like tagging • Ingredient 2: attachment ambiguity • Classic case: prepositional-phrase (PP) attachment • 1 PP: no ambiguity • 2 PPs: some ambiguity
Why parsing is harder than tagging III • 3 PPs: much more attachment ambiguity! • 5 PPs: 14 trees, 6 PPs: 42 trees, 7 PPs: 132 trees…
Why parsing is harder than tagging IV • Tree-structure ambiguity grows like the Catalan numbers (Knuth, 1975; Church & Patil, 1982) • This is factorial growth on top of the exponential growth associated with sequence label ambiguity
Why parsing is still tractable • This all makes parsing look really bad • But there’s still hope • Those factorially many parses are different combinations of common subparts
How to parse tractably • Recall that we did HMM part-of-speech tagging by storing partial results in a trellis • An HMM is a special type of grammar with essentially two types of rules: • “Category Y can follow category X (with cost π)” • “Category X can be realized as word w (with cost η)” • The trellis is a graph whose structure reflects its rules • Edges between all sequentially adjacent category pairs
How to parse tractably II • But a (weighted) CFG has more complicated rules: • “Category X can rewrite as categories α (with cost π)” • “Preterminal X can be realized as word w (with cost η)” • (2 is really a special case of 1) • A graph is not rich enough to reflect CFG/tree structure • Phrases need to be stored as partial results • We also need rule combination structure • We’ll do this with hypergraphs
How to parse tractably III • Hypergraphs are like graphs, but have hyper-edges instead of edges • “We observe a DT as word 1 and an NN as word 2.” • “Together, these let us infer an NP spanning words 1—2.” start state allows us to infer each of these both of these are needed to infer this
How to parse tractably IV Goal • Hypergraph for Bird shot flies • (only partial) Spanning words 1—3 Spanning words 1—2 Spanning words 2—3 Grammar: S NP VP VP V NP VP V NP N NP N N
How to parse tractably V • The nodes in the hypergraph can be thought of as being arranged in a triangle • For a sentence of length N, this is the upper right triangle of an N×N matrix • This matrix is called the parse chart
How to parse tractably VI • Before we study examples of parsing, let’s linger on the hypergraph for a moment • The goal of parsing is to fully interconnect all the evidence (words) and the goal • This could be done from the bottom up… • …or from the top down & left to right • These correspond to different parse strategies • Today: bottom-up (later: top-down)
Bottom-up (CKY) parsing • Bottom-up is the most straightforward efficient parsing algorithm to implement • Known as Cocke-Kasami-Young (CKY) algorithm • We’ll illustrate it for the weighted CFG instance • Each rule has a weight (log-prob) associated with it • We’re looking for the “lightest” (lowest-weight or, equivalently, highest-probability) tree T for sentence S • Implicitly this is Bayes’ rule!
CKY parsing II • Here’s the (partial) grammar we’ll use: • The sentence we’ll parse (see the ambiguity?): 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP 3 NP time 4 NP flies 4 VP flies 3Vst time 2 P like 5V like 1 Det an 8 N arrow Imperative verb: “Dothe dishes!” Time flies like an arrow
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
S Follow backpointers … 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
S NP VP 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
S NP VP PP VP 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
S NP VP PP VP P NP 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
S NP VP PP VP P NP Det N 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
Which entries do we need? 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
Which entries do we need? 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
Not worth keeping … 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
… since it just breeds worse options 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
Keep only best-in-class! “inferior stock” 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
Keep only best-in-class! (and backpointers so you can recover parse) 1 S NP VP 6 S Vst NP 2 S S PP 1 VP V NP 2 VP VP PP 1 NP Det N 2 NP NP PP 3 NP NP NP 0 PP P NP
Computational complexity of parsing • This approach has good space complexity • O(GN2)where G is the # categories in the grammar • What is the time complexity of the algorithm? • It’s cubic in N…why? • What about time complexity in G? • First, a clarification is in order • CFG rules can have right-hand sides of arbitrary length X α • But CKY works only w/ right-hand sides of max length 2 • So we need to convert the CFG for use with CKY