310 likes | 398 Views
Corpora and Statistical Methods Lecture 11. Albert Gatt. Part 1. Probabilistic Context-Free Grammars and beyond. Context-free grammars: reminder. Many NLP parsing applications rely on the CFG formalism Definition : CFG is a 4-tuple: ( N, Σ ,P,S) :
E N D
Corpora and Statistical MethodsLecture 11 Albert Gatt
Part 1 Probabilistic Context-Free Grammars and beyond
Context-free grammars: reminder • Many NLP parsing applications rely on the CFG formalism • Definition: • CFG is a 4-tuple: (N,Σ,P,S): • N = a set of non-terminal symbols (e.g. NP, VP) • Σ= a set of terminals (e.g. words) • N and Σ are disjoint • P = a set of productions of the form Aβ • A Є N • βЄ (N U Σ)* (any string of terminals and non-terminals) • S = a designated start symbol (usually, “sentence”)
CFG Example • S NP VP • S Aux NP VP • NP Det Nom • NP Proper-Noun • Det that | the | a • …
Probabilistic CFGs • A CFG where each production has an associated probability • PCFG is a 5-tuple: (N,Σ,P,S, D): • D: P -> [0,1] a function assigning each rule in P a probability • usually, probabilities are obtained from a corpus • most widely used corpus is the Penn Treebank
The Penn Treebank • English sentences annotated with syntax trees • built at the University of Pennsylvania • 40,000 sentences, about a million words • text from the Wall Street Journal • Other treebanks exist for other languages (e.g. NEGRA for German)
S NP VP VBZ NNP NNP NP Mr Vinken is NP PP NN NN IN chairman of NNP Elsevier Building a tree: rules • S NP VP • NP NNP NNP • NNP Mr • NNP Vinken • …
Characteristics of PCFGs • In a PCFG, the probability P(Aβ) expresses the likelihood that the non-terminal A will expand as β. • e.g. the likelihood that S NP VP • (as opposed to SVP, or S NP VP PP, or… ) • can be interpreted as a conditional probability: • probability of the expansion, given the LHS non-terminal • P(Aβ) = P(Aβ|A) • Therefore, for any non-terminal A, probabilities of every rule of the form A β must sum to 1 • If this is the case, we say the PCFG is consistent
Uses of probabilities in parsing • Disambiguation: given n legal parses of a string, which is the most likely? • e.g. PP-attachment ambiguity can be resolved this way • Speed: parsing is a search problem • search through space of possible applicable derivations • search space can be pruned by focusing on the most likely sub-parses of a parse • Parser can be used as a model to determine the probability of a sentence, given a parse • typical use in speech recognition, where input utterance can be “heard” as several possible sentences
Using PCFG probabilities • PCFG assigns a probability to every parse-tree t of a string W • e.g. every possible parse (derivation) of a sentence recognised by the grammar • Notation: • G = a PCFG • s = a sentence • t = a particular tree under our grammar • t consists of several nodes n • each node is generated by applying some rule r
simply the multiplication of the probability of every rule (node) that gives rise to t (i.e. the derivation of t) this is both the joint probability of t and s, and the probability of t alone why? Probability of a tree vs. a sentence
P(t,s) = P(t) • But P(s|t) must be 1, since the tree t is a parse of all the words of s
Picking the best parse in a PCFG • A sentence will usually have several parses • we usually want them ranked, or only want the n-best parses • we need to focus on P(t|s,G) • probability of a parse, given our sentence and our grammar • definition of the best parse for s:
Picking the best parse in a PCFG • Problem: t can have multiple derivations • e.g. expand left-corner nodes first, expand right-corner nodes first etc • so P(t|s,G) should be estimated by summing over all possible derivations • Fortunately, derivation order makes no difference to the final probabilities. • can assume a “canonical derivation” d of t • P(t) =def P(d)
Probability of a sentence • Simply the sum of probabilities of all parses of that sentence • since s is only a sentence if it’s recognised by G, i.e. if there is some t for s under G all those trees which “yield” s
Flaws I: Structural independence • Probability of a rule r expanding node n depends only on n. • Independent of other non-terminals • Example: • P(NP Pro) is independent of where the NP is in the sentence • but we know that NPPro is much more likely in subject position • Francis et al (1999) using the Switchboard corpus: • 91% of subjects are pronouns; • only 34% of objects are pronouns
Flaws II: lexical independence • vanilla PCFGs ignore lexical material • e.g. P(VP V NP PP) independent of the head of NP or PP or lexical head V • Examples: • prepositional phrase attachment preferences depend on lexical items; cf: • dump [sacks into a bin] • dump [sacks] [into a bin] (preferred parse) • coordination ambiguity: • [dogs in houses] and [cats] • [dogs] [in houses and cats]
Lexicalised PCFGs • Attempt to weaken the lexical independence assumption. • Most common technique: • mark each phrasal head (N,V, etc) with the lexical material • this is based on the idea that the most crucial lexical dependencies are between head and dependent • E.g.: Charniak 1997, Collins 1999
Lexicalised PCFGs: Matt walks • Makes probabilities partly dependent on lexical content. • P(VPVBD|VP) becomes: P(VPVBD|VP, h(VP)=walk) • NB: normally, we can’t assume that all heads of a phrase of category C are equally probable. S(walks) NP(Matt) VP(walk) NNP(Matt) VBD(walk) Matt walks
Practical problems for lexicalised PCFGs • data sparseness: we don’t necessarily see all heads of all phrasal categories often enough in the training data • flawed assumptions: lexical dependencies occur elsewhere, not just between head and complement • I got the easier problem of the two to solve • of the twoandto solvebecome more likely because of the prehead modifier easier
Structural context • The simple way: calculate p(t|s,G) based on rules in the canonical derivation d of t • assumes that p(t) is independent of the derivation • could condition on more structural context • but then we could lose the notion of a canonical derivation, i.e. P(t) could really depend on the derivation!
Structural context: probability of a derivation history • How to calculate P(t) based on a derivation d? • Observation: • (probability that a sequence of m rewrite rules in a derivation yields s) • can use the chain rule for multiplication
Approach 2: parent annotation • Annotate each node with its parent in the parse tree. • E.g. if NP has parent S, then rename NP to NP^S • Can partly account for dependencies such as subject-of • (NP^S is a subject, NP^VP is an object) S(walks) NP^S VP^S NNP^NP VBD^VP Matt walks
The main point • Many different parsing approaches differ on what they condition their probabilities on
Phrase structure vs. Dependency grammar • PCFGs are in the tradition of phrase-structure grammars • Dependency grammar describes syntax in terms of dependencies between words • no non-terminals or phrasal nodes • only lexical nodes with links between them • links are labelled, labels from a finite list
Dependency Grammar <ROOT> main GAVE obj: subj: dat: address I him attr: MY
Dependency grammar • Often used now in probabilistic parsing • Advantages: • directly encode lexical dependencies • therefore, disambiguation decisions take lexical material into account directly • dependencies are a way of decomposing PSRs and their probability estimates • estimating probability of dependencies between 2 words is less likely to lead to data sparseness problems
Summary • We’ve taken a tour of PCFGs • crucial notion: what the probability of a rule is conditioned on • flaws in PCFGs: independence assumptions • several proposals to go beyond these flaws • dependency grammars are an alternative formalism