1 / 41

Chapter 14: Statistical Parsing

Chapter 14: Statistical Parsing. Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran. Motivation and Outline. Previously, we used CFGs to parse with, but: Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse

mai
Download Presentation

Chapter 14: Statistical Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 14: Statistical Parsing Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran

  2. Motivation and Outline • Previously, we used CFGs to parse with, but: • Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse • How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus … • Where we’re going: • Probabilistic Context-Free Grammars (PCFGs) • Lexicalized PCFGs

  3. Probabilistic Context-Free Grammars (PCFGs) • We’ll have a CFG augmented with probabilities • Used for disambiguation (attachment and Coordination) • Used for Language Modeling that is estimate the probability of using a sentence (S) in the language • We used N-gram for this before … but using PCFGs is more likely to do better

  4. Probabilistic Context-Free Grammars (PCFGs) • Definition of a CFG: • Set of non-terminals (N) • Set of terminals (T) • Set of rules/productions (P), of the form Α β • Designated start symbol (S) • Definition of a PCFG: • Same as a CFG, but with one more function, D • D assigns probabilities to each rule in P

  5. Probabilities • The function D gives probabilities for a non-terminal A to be expanded to a sequence β. • Written as P(A  β) • or as P(A  β|A) • The idea is that, given A as the mother non-terminal (LHS), what is the likelihood that β is the correct RHS • Note that Σi (A  βi | A) = 1 • For example, we would augment a CFG with these probabilities: • P(S  NP VP | S) = .80 • P(S  Aux NP VP | S) = .15 • P(S  VP | S) = .05

  6. An Example

  7. P(T): Probability of a particular parse tree P(T,S) = ΠnєT p( r(n) ) = P(T).P(S|T) but P(S|T) = 1 ? P(T) = ΠnєT p( r(n) ) i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree Using Probabilities to Parse

  8. Disambiguated Parse Tree

  9. Using probabilities • So, the probability for that parse is 0.000015. What’s the big deal? • Probabilities are useful for comparing with other probabilities • Whereas we couldn’t decide between two parses using a regular CFG, we now can. • For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [NP John] [NP money]) or one NP: • A: [book [TWA] [flights]] • B: [book [TWA flights]] • Probabilities allows us to choose choice B

  10. Language Modeling Bigram estimation

  11. Language modeling and N-gram • N-gram problems (predicting after): • the contractended with a loss of 7 cents after trading as low as 9 cents • Use parse tree in order to predict words based on useful information (independent than words)

  12. Obtaining the best parse • Call the best parse T(S), where S is your sentence • Get the tree which has the highest probability, i.e. • T(S) = argmaxTєparse-trees(S) P(T) • Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse • CYK is a form of dynamic programming • CYK is a chart parser, like the Earley parser

  13. The Probabilistic CYK algorithm • Base case • Add words to the chart • Store P(A  wi) for every category A in the chart • Recursive case  makes this dynamic programming because we only calculate B and C once • Rules must be of the form A  BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF)) • Get the probability for A at this node by multiplying the probabilities for B and for C by P(A  BC) • P(B)*P(C)*P(A  BC) • For a given A, only keep the maximum probability (again, this is dynamic programming)

  14. PCYK pseudo-code

  15. Example: The flight includes a meal

  16. Learning Probabilities using a Treebank • Given a corpus of sentences annotated with syntactic annotation (e.g., the Penn Treebank) • Consider all parse trees • (1) Each time you have a rule of the form Aβ applied in a parse tree, increment a counter for that rule • (2) Also count the number of times A is on the left hand side of a rule • If you don’t have annotated data, but we have a non-probabilistic CFG, then

  17. Learning Probabilities from a non-probabilistic CFG • parse the corpus using the CFG, But … • Most sentences are ambiguous (have multiple parses) • Use all parses and weight them, them count the using of rules based on weights of parses, But how we can estimate the weights (chicken and egg) • Incrementally improve our estimates by beginning the parser with equal rules and literately compute the weighted rule counts • Inside-outside algorithm (special case of EM algorithm, like forward-backward) • E-step: Expectation Step • M-step: Maximization Step

  18. Problems with PCFGs(poor independence assumptions) • It’s still only a CFG, so dependencies on non-CFG info not captured • English e.g., Pronouns are more likely to be subjects than objects: • Persian e.g. pro drop are very likely to be when it’s subject • P[(NPPronoun) | NP=subj] >> P[(NPPronoun) | NP =obj]

  19. Problems with PCFGs (Ignore lexical information) • Ignores lexical information (statistics), which is usually crucial for disambiguation • (T1) America sent [[250,000 soldiers] [into Iraq]] • (T2) America sent [250,000 soldiers] [into Iraq] • send with into-PP always attached high (T2) in PTB! • To handle lexical information, we’ll turn to lexicalized PCFGs

  20. Ignore lexical information VP VBD NP PP VP VBD NP NP NP PP

  21. Ignore lexical information • Based on the probability of rules (VP VBD NP PP) and (VP VBD NP  NP NP PP) one of these attachments (VP or NP) are accepted for every PP independent of the lexical • NP attachment are more likely to be used but … • “fishermen caught tons of herring” • affinity between “dumps”, “into” is greater than “sacks”, “into” => VP attachments • But affinity between “tons”, “of” is greater than “caught”, “of” => NP attachments • Model Lexical dependency

  22. Handling Coordination ambiguity

  23. Dealing with PCFG problems: SPLITTING AND MERGING • For handling the problem of the fact that NPs in subject position tend to be pronoun: split the NP to NPsubject, NPobject • parent annotation:split the non-terminal based on parent non-terminal • NP^VP : object • NP^S : subject

  24. Parent annotation

  25. Splitting the pre-terminals • “If” prefer sentential complements than NP complements

  26. Split and Merge • Node-splitting is not without problems; it increases the size of the grammar, and hence reduces the amount of training data available for each grammar rule, leading to overfitting • Choosing the correct granularity is the key • Split and Merge iteratively • Start with simple grammar • alternately splits the non-terminals and merges together non-terminals, finding the set of annotated nodes which maximizes the likelihood of the training set treebank

  27. Probabilistic Lexicalized PCFGs • Remember how we talked about head information being passed up in a syntactic analysis? • Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word • In some sense, we can say that Book that flight is not just an S, but an S rooted in book • Thus, book is the headword of the whole sentence • By adding headword information to nonterminals, we wind up with a lexicalized grammar

  28. Probabilistic Lexicalized PCFGs • make a further extension to associate the head tag (pos of head)

  29. Incorporating Head Probabilities: Wrong Way • Simply adding headword w to node won’t work: • So, the node A becomes A[w] • e.g., P(A[w]β|A) =Count(A[w]β)/Count(A) • The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities • VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10-10 • VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10-11 • These probabilities are tiny, and others will never occur

  30. Lexicalized Grammars • Best Results until now, • Collins Parser • Charniak Parser

  31. Incorporating head probabilities: Right way (Collins Parser) • right-hand side of every (internal) CFG rule as consisting of a head non-terminal, together with the non-terminals to the left of the head, and the non-terminals to the right of the head

  32. Collins Parser • Instead of simple MLE probability break down the rule with a generative story • given the left-hand side, we first generate the head of the rule, and then generate the dependents of the head, one by one, from the inside out • Adding STOP in begin and end of RHS of the rules

  33. Generative Story

  34. Collins Parser

  35. Advanced Features on Collins • Distance feature

  36. Evaluating Parser Output • Dependency relations are also useful for comparing parser output to a treebank • Traditional measures of parser accuracy: • Cross bracket: ((A B) C) ? (A (B C))

  37. DISCRIMINATIVE RERANKING • generative parsers: probabilistic model gives us the probability of generating a particular sentence by assigning a probability to each choice the parser • But generative parsing models also make it hard to incorporate arbitrary kinds of information into the probability • Right-branching in English • Two-Stage Discriminative reranking • N-best list: output N-best parses instead of one single parse. • Extract features : parse probability assigned by the first-stage statistical parser, each of the CFG rules in the tree, the number of parallel conjuncts, • how heavy each constituent is, measures of how right-branching the parse tree is, how many times various tree fragments occur, bigrams of adjacent non-terminals in the tree

  38. PARSER-BASED LANGUAGE MODELING • Using Two-Stage modeling • First Stage: using a normal N-gram grammar and output N-best output using the n-gram modeling • Second Stage: we run our statistical parser and assign a parse probability to each sentence in the N-best list or lattice.

  39. Practice • 14.5, 14.6

More Related