LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 19 • 3/27/2013

Recommended reading • Jurafsky & Martin • Ch. 12, Formal grammars of English • Ch. 13, Parsing with context-free grammars • Ch. 14, Statistical parsing • Papers • Marcus et al. 1993, Penn Treebank • Klein & Manning 2003, Accurate Unlexicalized Parsing • Petrov et al 2006, Learning Accurate, Compact, and Interpretable Tree Annotation • Charniak & Johnson 2005, Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking

Last time • Context-free grammars • Specifies possible structures of sentences • Probabilistic context-free grammars • Generative probabilistic model for sentences and phrase structure trees • Probabilities indicate which structures are more likely • CKY parsing algorithm • Uses dynamic programming to compute an exponential number of parses in O(n3) time • Probabilistic CKY • Find most likely parse for a sentence • Provides a principled way of resolving ambiguity

How do we develop a parser? • Hard to write a non-trivial CFG that has broad coverage of the constructions of a language • For a PCFG, we also have to specify the probabilities • Not clear how this can be done by hand • Solution: read PCFG from an annotated corpus • Syntactically annotated corpus = treebank

Outline • Treebanks • Treebank grammars • Evaluating parsers • Improving parsing performance • Limitations of generative parsers

Penn Treebank • Corpus of syntactically annotated sentences • 1.3 million words of Wall Street Journal • 50,000 sentences • Annotated by hand for part of speech, syntactic structure, and predicate-argument relations • Standard corpus for developing and testing statistical parsers

You can download a portion of the Penn Treebank through NLTK • http://www.nltk.org/data >>> from nltk.corpus import treebank >>> sents = treebank.sents() # sentences >>> trees = treebank.parsed_sents() # labeled trees >>> len(trees) 3914 >>> trees[5].draw() # show tree in pop-up window >>> print trees[5] # display in brackets

Example treebank sentence and annotation (S (PP (IN Despite) (NP (DT the) (JJ gloomy) (NN forecast))) (, ,) (NP-SBJ (NNP South) (NNP Korea)) (VP (VBZ has) (VP (VBN recorded) (NP (NP (DT a) (NN trade) (NN surplus)) (PP (IN of) (NP (QP ($ $) (CD 71) (CD million)) (-NONE- *U*)))) (ADVP-TMP (IN so) (IN far)) (NP-TMP (DT this) (NN year)))) (. .))

Treebank has long, complex sentences Without the Cray-3 research and development expenses , the company would have been able *-2 to report a profit of $ 19.3 million *U* *ICH*-3 for the first half of 1989 rather than the $ 5.9 million *U* 0 it posted *T*-1 .

Trees tend to be “flat” • They don’t indicate fine details of constituent structure • (In the study of syntax in linguistics, nodes are usually binary branching)

Penn Treebank POS tag set for words

Treebank node labels • Phrase labels may be augmented with function tags • http://bulba.sdsu.edu/jeanette/thesis/PennTags.html • Examples: NP-SBJ: noun phrase that is a subject ADJP-PRD: predicative adjective PP-TMP: temporal prepositional phrase NP-TMP: temporal noun phrase NP-CLR: “closely related” to previous NP • Also, null elements and traces

NP subject, temporal PP, adverbial S

Null elements and traces • Have empty node in tree for places in the sentence where there is an argument that is implicit or stated elsewhere in the sentence • If stated elsewhere, a “trace” in the form of a numerical index is added to both the dislocated argument and its original position

Subject preposed twice

Obtain PCFG from a treebank • Want to build a parser automatically • Acquire CFG from a treebank • In a phrase structure tree, nodes and their children indicate constituency • Write a CFG rule for each such pattern • Extract PCFG • Count frequencies of nodes and their children in order to assign probability to rules

Ignore extra information • When developing a parser from the Penn Treebank, people typically ignore extra information like function tags, null elements, traces • Focus on constituency • Separate tasks: • Semantic role labeling • Label the arguments of verbs according to their semantic function • For example, label NP as subject or object • Recover null elements • http://www.seas.upenn.edu/~gabbard/fully_parsing_the_penn_treebank.pdf

Count frequencies of rules from tree • First get rid of functional tags • Then count rules S  NP VP . 1 NP  JJ 1 JJ  Many 1 VP  VDB NP 1 VBD  lost 1 NP  PRP$ NNS 1 PRP$  their 1 NNS  farms 1 .  . 1

Probability of a rule • Then, given all the rules for a particular LHS nonterminal, calculate the probability of each rule • Example, rule counts: A  b 5 A  c 10 • Probabilities: A  b 1/3 A  c 2/3

Some treebank VP rules • VP  VBD PP • VP  VBD PP PP • VP  VBD PP PP PP • VP  VBD PP PP PP PP • VP  VB ADVP PP • VP  VB PP ADVP • VP  ADVP VB PP • VP  VBD PP PP PP PP PP ADVP PP • This mostly happens because we [go] [from football] [in the fall] [to lifting] [in the winter] [to football] [again] [in the spring]

Some treebank NP rules • NP  DT JJ NN • NP  DT JJ NNS • NP  DT JJ NN NN • NP  DT JJ JJ NN • NP  DT JJ CD NNS • NP  RB DT JJ NN NN • NP  RB DT JJ JJ NNS • NP  DT JJ JJ NNP NNS • NP  DT NNP NNPNNPNNP JJ NN • NP  DT JJ NNP CC JJ JJ NN NNS • NP  RB DT JJS NN NN SBAR • NP  DT VBG JJ NNP NNP CC NNP • NP  DT JJ NNS , NNS CC NN NNS NN • NP  DT JJ JJ VBG NN NNP NNP FW NNP • NP  NP JJ , JJ '' SBAR '' NNS

Some complicated NP rules • NP  DT JJ JJ VBG NN NNP NNP FW NNP [The]DT [state-owned]JJ [industrial]JJ [holding]VBG [company]NN [Instituto]NNP [Nacional]NNP [de]FW [Industria]NNP • NP  NP JJ , JJ '' SBAR '' NNS [Shearson’s]NP [easy-to-film]JJ [,], [black-and-white]JJ ['']'' [Where We Stand]SBAR ['']'' [commercials]NNS

Treebank grammar rules • Treebank rules are different from toy grammar rules • More nonterminals on right-hand side • Large number of rules • Flat rules lead to large number of rules for a particular LHS nonterminal • (Penn Treebank: 1.3 million tokens) • 17,500 distinct rule types • 4,500 distinct VP rules

Procedure for statistical parsing • Read a grammar from a treebank • Convert CFG to appropriate form • Remove functional tags • If using CKY, convert to Chomsky Normal Form • Modify rules and labels further • Parse sentences • Convert back to original grammar • Compare your parse tree to gold standard in the treebank

Standard parsing setup • Penn Treebank • 50,000 sentences • Divided into 24 sections • Training: sections 02-21 • Development: section 22 • Test: section 23 • Sections 00 and 01 are not used because they are inconsistently annotated • These were the first sections they annotated

Different ways of assessing performance • Quantitative evaluation • Labeled precision • Labeled recall • Crossing brackets • Extensibility • Out-of-domain sentences • Other languages • Efficiency • Size of grammar • Parsing time

PARSEVAL metrics (Black et. al. 1991) • Measure by correct constituents, not sentences • Sentences too hard, esp. long ones • Constituent: • Sequence of words under a nonterminal in the parse tree • A constituent is correct when: • There exists a constituent with same span as in gold standard • Same nonterminal label as gold standard • Note: constituent doesn’t have to be recursively identical

PARSEVAL metrics (Black et. al. 1991) • Labeled recall: # of correct constituents in parse # of constituents in gold standard • Labeled precision: # of correct constituents in parse # of total constituents in parse • F-measure: 2*P*R / ( P + R )

Also crossing brackets • Crossing brackets, which are really bad: • parsed as ( A ( B C )) but correct parse is (( A B ) C ) • Report % of sentences with crossing brackets • State-of-the-art performance on Treebank: ~90% labeled recall ~90% labeled precision < 1% crossing brackets

Outline • Treebanks • Treebank grammars • Evaluating parsers • Improving parsing performance • (Some slides borrowed from D. Klein) • Limitations of generative parsers

Count frequencies of rules from tree • Count rules S  NP VP . 1 NP  JJ 1 JJ  Many 1 VP  VDB NP 1 VBD  lost 1 NP  PRP$ NNS 1 PRP$  their 1 NNS  farms 1 .  . 1 Get rid of functional tags

Improving parsing performance • F-measure for grammar read directly off of tree is 72.6% • We can get better performance by modifying the PCFG rules to indicate richer linguistic relationships

Probability and PCFG rules • X  Y Z • Y  … • Z  … • p(tree for X  Y Z ) = • p(rule X  Y Z ) * p(tree for Y) * p(tree for Z) • Expansion of Y is independent of parent X and sister Z (and similarly for Z) • Think in terms of generation: choose any X rule, and any Z rule • However, this isn’t what we want linguistically X Z Y

Independence assumption is too strong • Expansion of nonterminalsisn’t independent of context • Example: expansion of NP is dependent on parent • NP under S is a subject NP • NP under VP is an object NP

Encode linguistic context in the CFG • Relax independence assumptions by indicating linguistic dependencies in the grammar rules • Example: p(NP  NP PP | parent=VP) • Methods • Lexicalization • Unlexicalized • Horizontal Markovization • Vertical Markovization / parent annotation • Splitting tags

Lexical relationshipsindicated by paths in tree • How can we use lexical relationships in parsing PCFGs?

Indicate lexical head in tree • For each CFG rule, indicate the “head” child of the phrase • “Head” = the node that determines the linguistic properties of the phrase • Resulting CFG is a template for CFGs with specific words as heads • Example: CFG with head rules S -> NP VP[head] VP -> V[head] NP NP -> DT N[head]

Apply head rules to trees • Every nonterminal in tree is augmented with a head • Lexical relations are now encoded locally in the tree

Result: Lexicalized CFG • Example: S(questioned -> NP(lawyer) VP(questioned) VP(questioned) -> V(questioned) NP(witness) NP(lawyer) -> DT(the) N(lawyer) NP(witness) -> DT(the) N(witness) • Advantage: specify richer linguistic dependencies • Disadvantage: sparse data in probability estimation • Encodes specific words in the grammar

Unlexicalized parsers • Encode non-lexical information into tree nodes • Can encode multiple relationships • Methods • Parent annotation / Vertical Markovization • Condition on labels of ancestor nodes • Horizontal Markovization • Condition on labels of left/right neighbors • Splitting tags • Manual and automatic

Parent annotation (Special case of vertical Markovization)

Parent annotation takes care of this • Expansion of nonterminals isn’t independent of context • Example: expansion of NP is dependent on parent • NP under S is a subject NP • NP under VP is an object NP

v = variable length: back off to lower-order if freq of rule < 10

Flat trees lead to sparse data; rewrite as binary branching, encode history of previous constituent(s) F-measure v = variable length: back off to lower-order if freq of rule < 10

Manually split categories • Examples • NP: subject vs. object • DT: determiners (a/the) vs. demonstratives (this/that) • IN: sentential vs. prepositional • Advantages: • Linguistically motivated • Maintain a small category set

LING / C SC 439/539 Statistical Natural Language Processing