Probabilistic Parsing and Treebanks L545 Spring 2000

Probabilistic Parsing and Treebanks L545 Spring 2000

Motivation and Outline • Previously, we used CFGs to parse with, but: • Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse • We did not discuss how to obtain such grammars • Where we’re going: • Probabilistic Context-Free Grammars (PCFGs) + some discussion of treebanks • Lexicalized PCFGs • We’ll only cover PCFGs in very broad strokes - L645 offers more details

Statistical Parsing • Basic idea • Start with a treebank • a collection of sentences with syntactic annotation, i.e., already-parsed sentences • Examine which parse trees occur frequently • Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency • Result: a CFG augmented with probabilities

Probabilistic Context-Free Grammars (PCFGs) • Definition of a CFG (review): • Set of non-terminals (N) • Set of terminals (T) • Set of rules/productions (P), of the form Α β • Designated start symbol (S) • Definition of a PCFG: • Same as a CFG, but with one more function, D • D assigns probabilities to each rule in P

Probabilities • The function D gives probabilities for a non-terminal A to be expanded to a sequence β. • Written as P(A  β) • or as P(A  β|A) • Idea: given A as the mother non-terminal (LHS), what is the likelihood that β is the correct RHS? • Note that Σi (A  βi | A) = 1 • i.e., these are all the ways of expanding A • For example, we would augment a CFG with these probabilities: • P(S  NP VP | S) = .80 • P(S  Aux NP VP | S) = .15 • P(S  VP | S) = .05

Estimating Probabilities using a Treebank • Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) • Consider all parse trees • (1) Each time you have a rule of the form Aβ applied in a parse tree, increment a counter for that rule • (2) Also count the number of times A is on the left hand side of a rule • Divide (1) by (2) • P(Aβ|A) = Count(Aβ)/Count(A) • If you don’t have annotated data, parse the corpus (as we’ll describe next) and estimate the probabilities … which are then used to re-parse.

P(T): probability of a particular parse tree P(T) = ΠnєT p(r(n)) i.e., the product of the probabilities of the rules r used to expand each node n in the parse tree Using Probabilities to Parse

Computing probabilities • We have the following rules and probabilities (adapted from Figure 14.1): • S  VP .05 • VP  V NP .40 • NP  Det N .20 • V  book .30 • Det  that .05 • N  flight .25 • P(T) = P(SVP)*P(VPV NP)*…*P(Nflight) = .05*.40*.20*.30*.05*.25 = .000015, or 1.5 x 10-5

Using probabilities • So, the probability for that parse is 0.000015. What’ does this mean? • Probabilities are useful for comparing with other probabilities • Whereas we couldn’t decide between two parses using a regular CFG, we now can. • For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP: • A: [book [TWA] [flights]] • B: [book [TWA flights]] • Probabilities allows us to choose choice B

Obtaining the best parse • Call the best parse T(S), where S is your sentence • Get the tree which has the highest probability, i.e. • T(S) = argmaxTєparse-trees(S) P(T) • Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse

The CYK algorithm • Base case • Add words to the chart • Store P(A  w_i) for every category A in the chart • Recursive case • Get the probability for A at this node by multiplying the probabilities for B and for C by P(A  BC) • P(B)*P(C)*P(A  BC) • Only calculate this once • Rules must be of the form A  BC, i.e., exactly two items on the RHS (Chomsky Normal Form (CNF)) • For a given A, only keep the maximum probability • Previously, we kept A, but without any probabilities

Problems with PCFGs • It’s still only a CFG, so dependencies on non-CFG info not captured • e.g., Pronouns are more likely to be subjects than objects: • P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj] • Ignores lexical information (statistics), which is usually crucial for disambiguation • (T1) America sent [[250,000 soldiers] [into Iraq]] • (T2) America sent [250,000 soldiers] [into Iraq] • send with into-PP is the correct analysis (T2) because they “go well” together • To handle lexical information, we’ll turn to lexicalized PCFGs

Lexicalized Grammars • Remember how head information is passed up in a syntactic analysis? • e.g., VP[head [1]]  V[head [1]] NP • If you follow this down all the way to the bottom of a tree, you wind up with a head word • In some sense, we can say that Book that flight is not just an S, but an S rooted in book • Thus, book is the headword of the whole sentence • By adding headword information to nonterminals, we wind up with a lexicalized grammar

Lexicalized Parse Trees Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter The headword for a node is set to the head word of its head daughter Lexicalized PCFGs [book] [book] [flight] [flight]

Incorporating Head Probabilities: Wrong Way • Simply adding headword w to node won’t work: • So, the node A becomes A[w] • e.g., P(A[w]β|A) =Count(A[w]β)/Count(A) • The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities • VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10 • VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11 • These probabilities are tiny, and others will never occur

Incorporating head probabilities: Right way • Previously, we conditioned on the mother node (A): • P(Aβ|A) • Now, we can condition on the mother node and the headword of A (h(A)): • P(Aβ|A, h(A)) • We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head • e.g., P(VPVBD NP PP | VP, dumped) • The likelihood of VP expanding to VBD NP PP when dumped is the head is different than with ate

Calculating rule probabilities • We’ll write the probability more generally as: • P(r(n) | n, h(n)) • where n = node, r = rule, and h = headword • We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P(VP  VBD NP PP | VP, dumped) = C(VP(dumped)  VBD NP PP)/ Σβ C(VP(dumped)  β)

Adding info about word-word dependencies • We want to take into account one other factor: the probability of being a head word (in a given context) • P(h(n)=word | …) • We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) • P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) • P(sacks | NP, dumped) • What we’re really doing is factoring in how words relate to each other • These are dependency relations: sacks is dependent on dumped, in this case

Putting it all together • See sec. 14.6 for an example lexicalized parse tree for workers dumped sacks into a bin • For rules r, category n, head h, mother m P(T) = ΠnєT p(r(n)| n, h(n)) e.g., P(VP VBD NP PP |VP, dumped) subcategorization info * p(h(n) | n, h(m(n))) e.g. P(sacks | NP, dumped) dependency info between words

Evaluating Parser Output • Traditional measures of parser accuracy: • Labeled bracketing precision: # correct constituents in parse/# constituents in parse • Labeled bracketing recall: # correct constituents in parse/# (correct) constituents in treebank parse • There are known problems with these measures, so people are trying to use dependency-based measures instead • How many dependency relations did the parse get correct?

Probabilistic Parsing and Treebanks L545 Spring 2000