Chapter 12: Probabilistic Parsing and Treebanks

Chapter 12: Probabilistic Parsing and Treebanks Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

Motivation and Outline • Previously, we used CFGs to parse with, but: • Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse • How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus … • Where we’re going: • Probabilistic Context-Free Grammars (PCFGs) • Lexicalized PCFGs • Dependency Grammars

Statistical Parsing • Basic idea • Start with a treebank • a collection of sentences with syntactic annotation, i.e., already-parsed sentences • Examine which parse trees occur frequently • Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency • That is, we’ll have a CFG augmented with probabilities

Probabilistic Context-Free Grammars (PCFGs) • Definition of a CFG: • Set of non-terminals (N) • Set of terminals (T) • Set of rules/productions (P), of the form Α β • Designated start symbol (S) • Definition of a PCFG: • Same as a CFG, but with one more function, D • D assigns probabilities to each rule in P

Probabilities • The function D gives probabilities for a non-terminal A to be expanded to a sequence β. • Written as P(A  β) • or as P(A  β|A) • The idea is that, given A as the mother non-terminal (LHS), what is the likelihood that β is the correct RHS • Note that Σi (A  βi | A) = 1 • For example, we would augment a CFG with these probabilities: • P(S  NP VP | S) = .80 • P(S  Aux NP VP | S) = .15 • P(S  VP | S) = .05

Estimating Probabilities using a Treebank • Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) • Consider all parse trees • (1) Each time you have a rule of the form Aβ applied in a parse tree, increment a counter for that rule • (2) Also count the number of times A is on the left hand side of a rule • Divide (1) by (2) • P(Aβ|A) = Count(Aβ)/Count(A) • If you don’t have annotated data, parse the corpus (as we’ll describe next) and estimate the probabilities … which are then used to re-parse.

An Example

P(T): Probability of a particular parse tree P(T,S) = ΠnєT p( r(n) ) = P(T).P(S|T) but P(S|T) = 1 ? P(T) = ΠnєT p( r(n) ) i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree Example: given the probabilities on p. 449, compute the probability of the parse tree on the right Using Probabilities to Parse

Computing probabilities • We have the following rules and probabilities (adapted from Figure 12.1): • S  VP .05 • VP  V NP .40 • NP  Det N .20 • V  book .30 • Det  that .05 • N  flight .25 • P(T) = P(SVP)*P(VPV NP)*…*P(Nflight) = .05 * .40 * .20 * .30 * .05 * .25 = .000015, or 1.5 x 10-5

Using probabilities • So, the probability for that parse is 0.000015. What’s the big deal? • Probabilities are useful for comparing with other probabilities • Whereas we couldn’t decide between two parses using a regular CFG, we now can. • For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP: • A: [book [TWA] [flights]] • B: [book [TWA flights]] • Probabilities allows us to choose choice B (see figure 12.2)

Obtaining the best parse • Call the best parse T(S), where S is your sentence • Get the tree which has the highest probability, i.e. • T(S) = argmaxTєparse-trees(S) P(T) • Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse • CYK is a form of dynamic programming • CYK is a chart parser, like the Earley parser

The CYK algorithm • Base case • Add words to the chart • Store P(A  wi) for every category A in the chart • Recursive case  makes this dynamic programming because we only calculate B and C once • Rules must be of the form A  BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF)) • Get the probability for A at this node by multiplying the probabilities for B and for C by P(A  BC) • P(B)*P(C)*P(A  BC) • For a given A, only keep the maximum probability (again, this is dynamic programming)

PCYK pseudo-code

Example: The flight includes a meal

Problems with PCFGs • It’s still only a CFG, so dependencies on non-CFG info not captured • e.g., Pronouns are more likely to be subjects than objects: • P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]

Problems with PCFGs • Ignores lexical information (statistics), which is usually crucial for disambiguation • (T1) America sent [[250,000 soldiers] [into Iraq]] • (T2) America sent [250,000 soldiers] [into Iraq] • send with into-PP always attached high (T2) in PTB! • To handle lexical information, we’ll turn to lexicalized PCFGs

Ignore lexical information VP VBD NP PP VP VBD NP NP NP PP

Lexicalized Grammars • Remember how we talked about head information being passed up in a syntactic analysis? • e.g., VP[head *1]  V[head *1] NP • Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word • In some sense, we can say that Book that flight is not just an S, but an S rooted in book • Thus, book is the headword of the whole sentence • By adding headword information to nonterminals, we wind up with a lexicalized grammar

Lexicalized Grammars • Best Results until now, • Collins Parser • Charniak Parser

Lexicalized Parse Trees Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter The headword for a node is set to the head word of its head daughter Lexicalized PCFGs [book] [book] [flight] [flight]

Incorporating Head Probabilities: Wrong Way • Simply adding headword w to node won’t work: • So, the node A becomes A[w] • e.g., P(A[w]β|A) =Count(A[w]β)/Count(A) • The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities • VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10 • VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11 • These probabilities are tiny, and others will never occur

Incorporating head probabilities: Right way • Previously, we conditioned on the mother node (A): • P(Aβ|A) • Now, we can condition on the mother node and the headword of A (h(A)): • P(Aβ|A, h(A)) • We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head • e.g., P(VPVBD NP PP | VP, dumped)

Calculating rule probabilities • We’ll write the probability more generally as: • P(r(n) | n, h(n)) • where n = node, r = rule, and h = headword • We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P(VP  VBD NP PP | VP, dumped) = C(VP(dumped)  VBD NP PP)/ Σβ C(VP(dumped)  β)

Adding info about word-word dependencies • We want to take into account one other factor: the probability of being a head word (in a given context) • P(h(n)=word | …) • We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) • P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) • P(sacks | NP, dumped) • What we’re really doing is factoring in how words relate to each other • We will call this a dependency relation later: sacks is dependent on dumped, in this case

Putting it all together • See p. 459 for an example lexicalized parse tree for workers dumped sacks into a bin • For rules r, category n, head h, mother m P(T) = ΠnєT p(r(n)| n, h(n)) e.g., P(VP VBD NP PP |VP, dumped) subcategorization info * p(h(n) | n, h(m(n))) e.g. P(sacks | NP, dumped) dependency info between words

Dependency Grammar • Capturing relations between words (e.g. dumped and sacks) is moving in the direction of dependency grammar (DG) • In DG, there is no such thing as constituency • The structure of a sentence is purely the binary relations between words • John loves Mary is represented as: • LOVE  JOHN • LOVE  MARY • where A  B means that B depends on A

Evaluating Parser Output • Dependency relations are also useful for comparing parser output to a treebank • Traditional measures of parser accuracy: • Labeled bracketing precision: # correct constituents in parse/# constituents in parse • Labeled bracketing recall: # correct constituents in parse/# (correct) constituents in treebank parse • There are known problems with these measures, so people are trying to use dependency-based measures instead • How many dependency relations did the parse get correct?

Chapter 12: Probabilistic Parsing and Treebanks

Chapter 12: Probabilistic Parsing and Treebanks

Presentation Transcript

Foundations of Probabilistic Answers to Queries

Chapter 19 Probabilistic Dynamic Programming

An introduction to machine learning and probabilistic graphical models

SLAM: Simultaneous Localization and Mapping: Part I

Part II: Graphical models

Chapter 4: Syntax Analysis

Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 1

Theory of Compilation 236360 Erez Petrank

Compilation 0368 - 3133 (Semester A, 2013/14)

Chap6 LR Parsing

Chapter 5

Probabilistic Models for Parsing Images

Chapter 8

Semantic Representations with Probabilistic Topic Models

IP puzzles, probabilistic networking, and other projects at OGI@OHSU

Contents

Pushpak Bhattacharyya CSE Dept., IIT Bombay 1 st and 11 th Oct, 2012