590 likes | 732 Views
Natural Language Processing Syntactic Parsing. Meeting 13, Oct 11, 2012 Rodney Nielsen Most of these slides were adapted from James Martin. Subcategorization. Many valid VP rules But not valid for all verbs Subcategorize verbs by sets of VP rules Variation on transitive/intransitive
E N D
Natural Language ProcessingSyntactic Parsing Meeting 13, Oct 11, 2012 Rodney Nielsen Most of these slides were adapted from James Martin
Subcategorization • Many valid VP rules • But not valid for all verbs • Subcategorize verbs by sets of VP rules • Variation on transitive/intransitive • Grammars may have 100s of classes
Subcategorization • Sneeze: John sneezed • Find: Please find [a flight to NY]NP • Give: Give [me]NP[a cheaper fare]NP • Help: Can you help [me]NP[with a flight]PP • Prefer: Iprefer [to leave earlier]TO-VP • Told:Iwas told [United has a flight]S • …
Programming Analogy • Verbs = methods • Subcat frames specify the number, position and type of arguments • Like formal parameters to a method
Subcategorization • *John sneezed the book • *I prefer United has a flight • *Give with a flight • As with agreement phenomena, we need a way to formally express these facts
Treebanks • Treebanks • corpora of sentence parse trees • These are generally created • First automatically parse • Then correct • Detailed annotation guidelines • POS tagset • Grammar • Instructions per grammatical constructions
Penn Treebank • Penn TreeBank is a widely used treebank. • Most well known part is the Wall Street Journal section of the Penn TreeBank. • 1 M words from the 1987-1989 Wall Street Journal.
Lexically Decorated Tree • Head Finding
Dependency Grammars • CFG-style phrase-structure grammars • Focus on constituents • Dependency grammar • Tree • Nodes = words • Links = dependency relations • Relations may be typed (labeled), or not.
Dependency Parse They hid the letter on the shelf
Treebank and Head-Finding Uses • Critical to develop statistical parsers • Chapter 14 • Valuable to Corpus Linguistics • Investigating empirical details of constructions
Summary • CFGs model syntax • Parsers often critical applications components • Constituency: key phenomena easily captured with CFG rules • Agreement and subcategorization pose significant problems • Treebanks: corpus of sentence trees
Today 10/11/2012 Syntactic Parsing • CKY
CFG Parsing • Assigning proper trees • Trees that exactly cover the input • Not necessarily the correct tree
For Now • Assume… • Words are in a buffer • No POS tags • Ignore morphology • Words are known • No out of vocabulary (OOV) terms • Poor assumptions for a real application
Top-Down Search • Start with a rule mapping to S (sentences) • Progress down from there to the words
Bottom-Up Parsing • Or … • Start with trees rooted at the words • Progress up to larger trees
Top-Down versus Bottom-Up • Top-down • Proper, feasible trees • But potentially inconsistent with the words • Bottom-up • Consistent with the words • But trees might not make sense globally
Search Strategy • How to search space and make choices • Node to expand next? • Grammar rule used for expansion • Backtracking • Make a choice • If it works, continue • If not, back up and make a different choice
Problems • Even with the best filtering, backtracking methods are doomed because of two inter-related problems • Ambiguity • Shared subproblems
Shared Sub-Problems • No matter what kind of search • Don’t want to redo work already done • Naïve backtracking leads to duplicated work
Shared Sub-Problems • Consider: a flight from Indianapolis to Houston on TWA
Shared Sub-Problems • Assume a top-down parse making choices among the various Nominal rules • In particular, between these two • Nominal -> Noun • Nominal -> Nominal PP • Statically choosing the rules in this order leads to the following bad behavior...
Dynamic Programming • Dynamic Programming search • Fill tables with partial results • Avoid repeating work • Solve exponential problems in nearly polynomial time • Efficiently store ambiguous structures with shared sub-parts • Bottom-up approach • CKY • Top-down approach • Earley
CKY Parsing • Limit grammar to epsilon-free binary rules • Consider the rule A BC • If there is an A somewhere in the input generated by this rule then there must be a B followed by a C in the input • If A spans [i to j), there must be a k st. i<k<j • I.e., B splits from C someplace after i and before j
Problem • What if your grammar isn’t binary? • E.g., the Penn TreeBank • Convert it to binary • Any CFG can be rewritten into Chomsky-Normal Form automatically • What does this mean? • Resulting grammar accepts (and rejects) the same set of strings • But the derivations (trees) are binary
CKY • Build a table so A spanning [i to j) in the input is placed in cell [i, j] in the table • Non-terminal spanning entire string is in [0, n] • Parts of Amust be i to k & k to j, for some k
CKY • Given A B C • Look for B in [i,k] and C in [k,j]. • I.e., if there is an A spanning i,jAND • A B C THEN • There must be a B in [i,k] and a C in [k,j] for some k such that i<k<j
CKY • Fill the table by looping over the cell values [i, j] in a systematic way • For each cell, loop over the appropriate k values to search for things to add
CKY Algorithm What’s the complexity of this?
CKY • Fills the table one column at a time, from left to right, bottom to top • When filling a cell, the parts needed are already in the table (to the left and below) • It’s somewhat natural in that it processes the input left to right a word at a time • Known as online
Example • Filling col 5 == processing word 5 (Houston) • j is 5. • i goes from 3 to 0 (3,2,1,0)