Probabilistic and Lexicalized Parsing

Probabilistic and Lexicalized Parsing COMS 4705 Fall 2004

Probabilistic CFGs • The probabilistic model • Assigning probabilities to parse trees • Disambiguate, LM for ASR, faster parsing • Getting the probabilities for the model • Parsing with probabilities • Slight modification to dynamic programming approach • Task: find max probability tree for an input string COMS 4705 – Fall 2004

Probability Model • Attach probabilities to grammar rules • Expansions for a given non-terminal sum to 1 VP -> V .55 VP -> V NP .40 VP -> V NP NP .05 • Read this as P(Specific rule | LHS) • “What’s the probability that VP will expand to V, given that we have a VP?” COMS 4705 – Fall 2004

Probability of a Derivation • A derivation (tree) consists of the set of grammar rules that are in the parse tree • The probability of a tree is just the product of the probabilities of the rules in the derivation • Note the independence assumption – why don’t we use conditional probabilities? COMS 4705 – Fall 2004

Probability of a Sentence • Probability of a word sequence (sentence) is probability of its tree in unambiguous case • Sum of probabilities of possible trees in ambiguous case COMS 4705 – Fall 2004

Getting the Probabilities • From an annotated database • E.g. the Penn Treebank • To get the probability for a particular VP rule just count all the times the rule is used and divide by the number of VPs overall • What if you have no treebank (e.g. for a ‘new’ language)? COMS 4705 – Fall 2004

Assumptions • We have a grammar to parse with • We have a large robust dictionary with parts of speech • We have a parser • Given all that… we can parse probabilistically COMS 4705 – Fall 2004

Typical Approach • Bottom-up (CYK) dynamic programming approach • Assign probabilities to constituents as they are completed and placed in the table • Use the max probability for each constituent going up the tree COMS 4705 – Fall 2004

How do we fill in the DP table? • Say we’re talking about a final part of a parse– finding an S, e.g., of the rule • S->0NPiVPj The probability of S is… P(S->NP VP)*P(NP)*P(VP) The acqua part is already known, since we’re doing bottom-up parsing. We don’t need to recalculate the probabilities of constituents lower in the tree. COMS 4705 – Fall 2004

Using the Maxima • P(NP) is known • But what if there are multiple NPs for the span of text in question (0 to i)? • Take the max (Why?) • Does not mean that other kinds of constituents for the same span are ignored (i.e. they might be in the solution) COMS 4705 – Fall 2004

S -> NP VP VP -> V NP NP -> NP PP VP -> VP PP PP -> P NP NP -> John, Mary, Denver V -> called P -> from CYK Parsing: John called Mary from Denver COMS 4705 – Fall 2004

Example S NP VP PP VP V NP NP P John called Mary from Denver COMS 4705 – Fall 2004

Example S NP VP NP NP PP V John called Mary from Denver COMS 4705 – Fall 2004

Example COMS 4705 – Fall 2004

Base Case: Aw COMS 4705 – Fall 2004

Recursive Cases: ABC COMS 4705 – Fall 2004

COMS 4705 – Fall 2004

Problems with PCFGs • The probability model we’re using is just based on the rules in the derivation… • Doesn’t use the words in any real way – e.g. PP attachment often depends on the verb, its object, and the preposition (I ate pickles with a fork. I ate pickles with relish.) • Doesn’t take into account where in the derivation a rule is used – e.g. pronouns more often subjects than objects (She hates Mary. Mary hates her.) COMS 4705 – Fall 2004

Solution • Add lexical dependencies to the scheme… • Add the predilections of particular words into the probabilities in the derivation • I.e. Condition the rule probabilities on the actual words COMS 4705 – Fall 2004

Heads • Make use of notion of the headof a phrase, e.g. • The head of an NP is its noun • The head of a VP is its verb • The head of a PP is its preposition • Phrasal heads • Can ‘take the place of’ whole phrases, in some sense • Define most important characteristics of the phrase • Phrases are generally identified by their heads COMS 4705 – Fall 2004

Example (correct parse) Attribute grammar COMS 4705 – Fall 2004

Example (wrong) COMS 4705 – Fall 2004

How? • We started with rule probabilities • VP -> V NP PP P(rule|VP) • E.g., count of this rule divided by the number of VPs in a treebank • Now we want lexicalized probabilities • VP(dumped)-> V(dumped) NP(sacks)PP(in) • P(r|VP ^ dumped is the verb ^ sacks is the head of the NP ^ in is the head of the PP) • Not likely to have significant counts in any treebank COMS 4705 – Fall 2004

Declare Independence • So, exploit independence assumption and collect the statistics you can… • Focus on capturing two things • Verb subcategorization • Particular verbs have affinities for particular VPs • Objects have affinities for their predicates (mostly their mothers and grandmothers) • Some objects fit better with some predicates than others COMS 4705 – Fall 2004

Verb Subcategorization • Condition particular VP rules on their head… so r: VP -> V NP PP P(r|VP) Becomes P(r | VP ^ dumped) What’s the count? How many times was this rule used with dump, divided by the number of VPs that dump appears in total COMS 4705 – Fall 2004

Preferences • Subcat captures the affinity between VP heads (verbs) and the VP rules they go with. • What about the affinity between VP heads and the heads of the other daughters of the VP? COMS 4705 – Fall 2004

Example (correct parse) COMS 4705 – Fall 2004

Example (wrong) COMS 4705 – Fall 2004

Preferences • The issue here is the attachment of the PP • So the affinities we care about are the ones between dumped and into vs. sacks and into. • Count the times dumped is the head of a constituent that has a PP daughter with into as its head and normalize (alternatively, P(into|PP,dumped is mother’s head)) • Vs. the situation where sacks is a constituent with into as the head of a PP daughter (or, P(into|PP,sacks is mother’s head)) COMS 4705 – Fall 2004

Another Example • Consider the VPs • Ate spaghetti with gusto • Ate spaghetti with marinara • The affinity of gusto for eat is much larger than its affinity for spaghetti • On the other hand, the affinity of marinara for spaghetti ismuch higher than its affinity for ate COMS 4705 – Fall 2004

But not a Head Probability Relationship • Note the relationship here is more distant and doesn’t involve a headword since gusto and marinara aren’t the heads of the PPs (Hindle & Rooth ’91) Vp (ate) Vp(ate) Np(spag) Vp(ate) Pp(with) np Pp(with) v np v Ate spaghetti with marinara Ate spaghetti with gusto COMS 4705 – Fall 2004

Next Time • Midterm • Covers everything assigned and all lectures up through today • Short answers (2-3 sentences), exercises, longer answers (1 para) • Closed book, calculators allowed but no laptops COMS 4705 – Fall 2004

Probabilistic and Lexicalized Parsing