Statistical NLP Winter 2009

Statistical NLPWinter 2009 Lecture 10: Parsing I Roger Levy Thanks to Jason Eisner & Dan Klein for slides

Why is natural language parsing hard? • As language structure gets more abstract, computing it gets harder • Document classification • finite number of classes • fast computation at test time • Part-of-speech tagging (recovering label sequences) • Exponentially many possible tag sequences • But exact computation possible in O(n) • Parsing (recovering labeled trees) • Exponentially many, or even infinite, possible trees • Exact inference worse than tagging, but still within reach

Why is parsing harder than tagging • How many trees are there for a given string? • Imagine a rule VPVP • …∞! • This is not a problem for inferring availability of structures (why?) • Nor is this a problem for inferring the most probable structure in a PCFG (why?)

Why parsing is harder than tagging II • Ingredient 1: syntactic category ambiguity • Exponentially many category sequences, like tagging • Ingredient 2: attachment ambiguity • Classic case: prepositional-phrase (PP) attachment • 1 PP: no ambiguity • 2 PPs: some ambiguity

Why parsing is harder than tagging III • 3 PPs: much more attachment ambiguity! • 5 PPs: 14 trees, 6 PPs: 42 trees, 7 PPs: 132 trees…

Why parsing is harder than tagging IV • Tree-structure ambiguity grows like the Catalan numbers (Knuth, 1975; Church & Patil, 1982) • This is factorial growth on top of the exponential growth associated with sequence label ambiguity

Why parsing is still tractable • This all makes parsing look really bad • But there’s still hope • Those factorially many parses are different combinations of common subparts

How to parse tractably • Recall that we did HMM part-of-speech tagging by storing partial results in a trellis • An HMM is a special type of grammar with essentially two types of rules: • “Category Y can follow category X (with cost π)” • “Category X can be realized as word w (with cost η)” • The trellis is a graph whose structure reflects its rules • Edges between all sequentially adjacent category pairs

How to parse tractably II • But a (weighted) CFG has more complicated rules: • “Category X can rewrite as categories α (with cost π)” • “Preterminal X can be realized as word w (with cost η)” • (2 is really a special case of 1) • A graph is not rich enough to reflect CFG/tree structure • Phrases need to be stored as partial results • We also need rule combination structure • We’ll do this with hypergraphs

How to parse tractably III • Hypergraphs are like graphs, but have hyper-edges instead of edges • “We observe a DT as word 1 and an NN as word 2.” • “Together, these let us infer an NP spanning words 1—2.” start state allows us to infer each of these both of these are needed to infer this

How to parse tractably IV Goal • Hypergraph for Bird shot flies • (only partial) Spanning words 1—3 Spanning words 1—2 Spanning words 2—3 Grammar: S NP VP VP V NP VP V NP N NP N N

How to parse tractably V • The nodes in the hypergraph can be thought of as being arranged in a triangle • For a sentence of length N, this is the upper right triangle of an N×N matrix • This matrix is called the parse chart

How to parse tractably VI • Before we study examples of parsing, let’s linger on the hypergraph for a moment • The goal of parsing is to fully interconnect all the evidence (words) and the goal • This could be done from the bottom up… • …or from the top down & left to right • These correspond to different parse strategies • Today: bottom-up (later: top-down)

Bottom-up (CKY) parsing • Bottom-up is the most straightforward efficient parsing algorithm to implement • Known as Cocke-Kasami-Young (CKY) algorithm • We’ll illustrate it for the weighted CFG instance • Each rule has a weight (log-prob) associated with it • We’re looking for the “lightest” (lowest-weight or, equivalently, highest-probability) tree T for sentence S • Implicitly this is Bayes’ rule!

CKY parsing II • Here’s the (partial) grammar we’ll use: • The sentence we’ll parse (see the ambiguity?): 1 S NP VP 6 S Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 3 NP  time 4 NP  flies 4 VP  flies 3Vst time 2 P  like 5V like 1 Det an 8 N  arrow Imperative verb: “Dothe dishes!” Time flies like an arrow

1 S NP VP 6 S Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

S Follow backpointers … 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

S NP VP 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

S NP VP PP VP 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

S NP VP PP VP P NP 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

S NP VP PP VP P NP Det N 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

Which entries do we need? 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

Not worth keeping … 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

… since it just breeds worse options 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

Keep only best-in-class! “inferior stock” 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

Keep only best-in-class! (and backpointers so you can recover parse) 1 S NP VP 6 S Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

Computational complexity of parsing • This approach has good space complexity • O(GN2)where G is the # categories in the grammar • What is the time complexity of the algorithm? • It’s cubic in N…why? • What about time complexity in G? • First, a clarification is in order • CFG rules can have right-hand sides of arbitrary length X α • But CKY works only w/ right-hand sides of max length 2 • So we need to convert the CFG for use with CKY

Statistical NLP Winter 2009