290 likes | 516 Views
Chapter 12: Probabilistic Parsing and Treebanks. Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran. Motivation and Outline. Previously, we used CFGs to parse with, but: Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse
E N D
Chapter 12: Probabilistic Parsing and Treebanks Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran
Motivation and Outline • Previously, we used CFGs to parse with, but: • Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse • How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus … • Where we’re going: • Probabilistic Context-Free Grammars (PCFGs) • Lexicalized PCFGs • Dependency Grammars
Statistical Parsing • Basic idea • Start with a treebank • a collection of sentences with syntactic annotation, i.e., already-parsed sentences • Examine which parse trees occur frequently • Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency • That is, we’ll have a CFG augmented with probabilities
Probabilistic Context-Free Grammars (PCFGs) • Definition of a CFG: • Set of non-terminals (N) • Set of terminals (T) • Set of rules/productions (P), of the form Α β • Designated start symbol (S) • Definition of a PCFG: • Same as a CFG, but with one more function, D • D assigns probabilities to each rule in P
Probabilities • The function D gives probabilities for a non-terminal A to be expanded to a sequence β. • Written as P(A β) • or as P(A β|A) • The idea is that, given A as the mother non-terminal (LHS), what is the likelihood that β is the correct RHS • Note that Σi (A βi | A) = 1 • For example, we would augment a CFG with these probabilities: • P(S NP VP | S) = .80 • P(S Aux NP VP | S) = .15 • P(S VP | S) = .05
Estimating Probabilities using a Treebank • Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) • Consider all parse trees • (1) Each time you have a rule of the form Aβ applied in a parse tree, increment a counter for that rule • (2) Also count the number of times A is on the left hand side of a rule • Divide (1) by (2) • P(Aβ|A) = Count(Aβ)/Count(A) • If you don’t have annotated data, parse the corpus (as we’ll describe next) and estimate the probabilities … which are then used to re-parse.
P(T): Probability of a particular parse tree P(T,S) = ΠnєT p( r(n) ) = P(T).P(S|T) but P(S|T) = 1 ? P(T) = ΠnєT p( r(n) ) i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree Example: given the probabilities on p. 449, compute the probability of the parse tree on the right Using Probabilities to Parse
Computing probabilities • We have the following rules and probabilities (adapted from Figure 12.1): • S VP .05 • VP V NP .40 • NP Det N .20 • V book .30 • Det that .05 • N flight .25 • P(T) = P(SVP)*P(VPV NP)*…*P(Nflight) = .05 * .40 * .20 * .30 * .05 * .25 = .000015, or 1.5 x 10-5
Using probabilities • So, the probability for that parse is 0.000015. What’s the big deal? • Probabilities are useful for comparing with other probabilities • Whereas we couldn’t decide between two parses using a regular CFG, we now can. • For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP: • A: [book [TWA] [flights]] • B: [book [TWA flights]] • Probabilities allows us to choose choice B (see figure 12.2)
Obtaining the best parse • Call the best parse T(S), where S is your sentence • Get the tree which has the highest probability, i.e. • T(S) = argmaxTєparse-trees(S) P(T) • Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse • CYK is a form of dynamic programming • CYK is a chart parser, like the Earley parser
The CYK algorithm • Base case • Add words to the chart • Store P(A wi) for every category A in the chart • Recursive case makes this dynamic programming because we only calculate B and C once • Rules must be of the form A BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF)) • Get the probability for A at this node by multiplying the probabilities for B and for C by P(A BC) • P(B)*P(C)*P(A BC) • For a given A, only keep the maximum probability (again, this is dynamic programming)
Problems with PCFGs • It’s still only a CFG, so dependencies on non-CFG info not captured • e.g., Pronouns are more likely to be subjects than objects: • P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]
Problems with PCFGs • Ignores lexical information (statistics), which is usually crucial for disambiguation • (T1) America sent [[250,000 soldiers] [into Iraq]] • (T2) America sent [250,000 soldiers] [into Iraq] • send with into-PP always attached high (T2) in PTB! • To handle lexical information, we’ll turn to lexicalized PCFGs
Ignore lexical information VP VBD NP PP VP VBD NP NP NP PP
Lexicalized Grammars • Remember how we talked about head information being passed up in a syntactic analysis? • e.g., VP[head *1] V[head *1] NP • Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word • In some sense, we can say that Book that flight is not just an S, but an S rooted in book • Thus, book is the headword of the whole sentence • By adding headword information to nonterminals, we wind up with a lexicalized grammar
Lexicalized Grammars • Best Results until now, • Collins Parser • Charniak Parser
Lexicalized Parse Trees Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter The headword for a node is set to the head word of its head daughter Lexicalized PCFGs [book] [book] [flight] [flight]
Incorporating Head Probabilities: Wrong Way • Simply adding headword w to node won’t work: • So, the node A becomes A[w] • e.g., P(A[w]β|A) =Count(A[w]β)/Count(A) • The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities • VP(dumped) VBD(dumped) NP(sacks) PP(into) 3x10-10 • VP(dumped) VBD(dumped) NP(cats) PP(into) 8x10-11 • These probabilities are tiny, and others will never occur
Incorporating head probabilities: Right way • Previously, we conditioned on the mother node (A): • P(Aβ|A) • Now, we can condition on the mother node and the headword of A (h(A)): • P(Aβ|A, h(A)) • We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head • e.g., P(VPVBD NP PP | VP, dumped)
Calculating rule probabilities • We’ll write the probability more generally as: • P(r(n) | n, h(n)) • where n = node, r = rule, and h = headword • We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P(VP VBD NP PP | VP, dumped) = C(VP(dumped) VBD NP PP)/ Σβ C(VP(dumped) β)
Adding info about word-word dependencies • We want to take into account one other factor: the probability of being a head word (in a given context) • P(h(n)=word | …) • We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) • P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) • P(sacks | NP, dumped) • What we’re really doing is factoring in how words relate to each other • We will call this a dependency relation later: sacks is dependent on dumped, in this case
Putting it all together • See p. 459 for an example lexicalized parse tree for workers dumped sacks into a bin • For rules r, category n, head h, mother m P(T) = ΠnєT p(r(n)| n, h(n)) e.g., P(VP VBD NP PP |VP, dumped) subcategorization info * p(h(n) | n, h(m(n))) e.g. P(sacks | NP, dumped) dependency info between words
Dependency Grammar • Capturing relations between words (e.g. dumped and sacks) is moving in the direction of dependency grammar (DG) • In DG, there is no such thing as constituency • The structure of a sentence is purely the binary relations between words • John loves Mary is represented as: • LOVE JOHN • LOVE MARY • where A B means that B depends on A
Evaluating Parser Output • Dependency relations are also useful for comparing parser output to a treebank • Traditional measures of parser accuracy: • Labeled bracketing precision: # correct constituents in parse/# constituents in parse • Labeled bracketing recall: # correct constituents in parse/# (correct) constituents in treebank parse • There are known problems with these measures, so people are trying to use dependency-based measures instead • How many dependency relations did the parse get correct?