Introduction to Natural Language Processing (600.465) Parsing: Introduction

Introduction to Natural Language Processing (600.465)Parsing: Introduction

Context-free Grammars • Chomsky hierarchy • Type 0 Grammars/Languages • rewrite rules a→b; a,b are any string of terminals and nonterminals • Context-sensitive Grammars/Languages • rewrite rules: aXb→ agb, where X is nonterminal, a,b,g any string of terminals and nonterminals (g must not be empty) • Context-free Grammars/Lanuages • rewrite rules: X → g, where X is nonterminal, g any string of terminals and nonterminals • Regular Grammars/Languages • rewrite rules: X → a Y where X,Y are nonterminals, a string of terminal symbols; Y might be missing

Parsing Regular Grammars • Finite state automata • Grammar ↔regular expression ↔finite state automaton • Space needed: • constant • Time needed to parse: • linear (~ length of input string) • Cannot do e.g. anbn , embedded recursion (context-free grammars can)

Parsing Context Free Grammars • Widely used for surface syntax description (or better to say, for correct word-order specification) of natural languages • Space needed: • stack (sometimes stack of stacks) • in general: items ~ levels of actual (i.e. in data) recursions • Time: in general, O(n3) • Cannot do: e.g. anbncn (Context-sensitive grammars can)

Example Toy NL Grammar S • #1 S → NP • #2 S →NP VP • #3 VP →V NP • #4 NP →N • #5 N →flies • #6 N →saw • #7 V →flies • #8 V →saw VP NP NP N V N flies saw saw

Probabilistic Parsing and PCFGs CS 224n / Lx 237 Monday, May 3 2004

Modern Probabilistic Parsers • A greatly increased ability to do accurate, robust, broad coverage parsers (Charniak 1997, Collins 1997, Ratnaparkhi 1997, Charniak 2000) • Converts parsing into a classification task using statistical / machine learning methods • Statistical methods (fairly) accurately resolve structural and real world ambiguities • Much faster – often in linear time (by using beam search) • Provide probabilistic language models that can be integrated with speech recognition systems

Supervised parsing • Crucial resources have been treebanks such as the Penn Treebank (Marcus et al. 1993) • From these you can train classifiers. • Probabilistic models • Decision trees • Decision lists / transformation-based learning • Possible only when there are extensive resources • Uninteresting from a Cog Sci point of view

Probabilistic Models for Parsing • Conditional / Parsing Model/ discriminative: • We estimate directly the probability of a parse tree ˆt = argmaxt P(t|s, G) where Σt P(t|s, G) = 1 • Odd in that the probabilities are conditioned on a particular sentence. • We don’t learn from the distribution of specific sentences we see (nor do we assume some specific distribution for them)  need more general classes of data

Probabilistic Models for Parsing • Generative / Joint / Language Model: • Assigns probability to all trees generated by the grammar. Probabilities, then, are for the entire language L: Σ{t:yield(t)  L} P(t) = 1 – language model for all trees (all sentences) • We then turn the language model into a parsing model by dividing the probability of a tree (p(t)) in the language model by the probability of the sentence (p(s)). This becomes the joint probability P(t, s| G) ˆt = argmaxt P(t|s)[parsing model] = argmaxt P(t,s) / P(s)= argmaxt P(t,s)[generative model] = argmaxt P (t) Language model (for specific sentence) can be used as a parsing model to choose between alternative parses P(s) = Σt p(s, t) = Σ {t: yield(t)=s} P(t)

Syntax • One big problem with HMMs and n-gram models is that they don’t account for the hierarchical structure of language • They perform poorly on sentences such as The velocity of the seismic waves rises to … • Doesn’t expect a singular verb (rises) after a plural noun (waves) • The noun waves gets reanalyzed as a verb • Need recursive phrase structure

Syntax – recursive phrase structure S NPsg VPsg DT NN PP rises to … the velocity IN NPpl of the seismic waves

PCFGs • The simplest method for recursive embedding is a Probabilistic Context Free Grammar (PCFG) • A PCFG is basically just a weighted CFG.

PCFGs • A PCFG G consists of : • A set of terminals, {wk}, k=1,…,V • A set of nonterminals, {Ni}, i=1,…,n • A designated start symbol, N1 • A set of rules, {Niζj}, where ζj is a sequence of terminals and nonterminals • A set of probabilities on rules such that for all i: ΣjP(Niζj | Ni ) = 1 • A convention: we’ll write P(Niζj) to mean P(Niζj | Ni )

PCFGs - Notation • w1n = w1 … wn = the sequence from 1 to n (sentence of length n) • wab = the subsequence wa … wb • Njab= the nonterminal Njdominating wa … wb Nj wa … wb

Finding most likely string • P(t) -- The probability of tree is the product of the probabilities of the rules used to generate it. • P(w1n) -- The probability of the string is the sum of the probabilities of the trees which have that string as their yield P(w1n) = ΣjP(w1n, tj) where tj is a parse of w1n = ΣjP(tj)

A Simple PCFG (in CNF)

Tree and String Probabilities • w15 = string ‘astronomers saw stars with ears’ • P(t1) = 1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0009072 • P(t2) = 1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0006804 • P(w15) = P(t1) + P(t2) = 0.0009072 + 0.0006804 = 0.0015876

Assumptions of PCFGs • Place invariance (like time invariance in HMMs): • The probability of a subtree does not depend on where in the string the words it dominates are • Context-free: • The probability of a subtree does not depend on words not dominated by the subtree • Ancestor-free: • The probability of a subtree does not depend on nodes in the derivation outside the subtree

Some Features of PCFGs • Partial solution for grammar ambiguity: a PCFG gives some idea of the plausibility of a sentence • But not so good as independence assumptions are too strong • Robustness (admit everything, but low probability) • Gives a probabilistic language model • But in a simple case it performs worse than a trigram model • Better for grammar induction (Gold 1967 v Horning 1969)

Some Features of PCFGs • Encodes certain biases (shorter sentences normally have higher probability) • Could combine PCFGs with trigram models • Could lessen the independence assumptions • Structure sensitivity • Lexicalization

Structure sensitivity • Manning and Carpenter 1997, Johnson 1998 • Expansion of nodes depends a lot on their position in the tree (independent of lexical content) Pronoun Lexical Subject 91% 9% Object 34% 66% • We can encode more information into the nonterminal space by enriching nodes to also record information about their parents • SNP is different than VPNP

Structure sensitivity • Another example: the dispreference for pronouns to be second object NPs of ditransitive verb • I gave Charlie the book • I gave the book to Charlie • I gave you the book • ? I gave the book to you

(Head) Lexicalization • The head word of a phrase gives a good representation of the phrase’s structure and meaning • Attachment ambiguities The astronomer saw the moon with the telescope • Coordination the dogs in the house and the cats • Subcategorization frames put versus like

(Head) Lexicalization • put takes both an NP and a VP • Sue put [ the book ]NP[ on the table ]PP • * Sue put [ the book ]NP • * Sue put [ on the table ]PP • like usually takes an NP and not a PP • Sue likes [ the book ]NP • * Sue likes [ on the table ]PP

(Head) Lexicalization • Collins 1997, Charniak 1997 • Puts the properties of the word back in the PCFG Swalked NPSue VPwalked Sue Vwalked PPinto walked Pinto NPstore into DTtheNPstore the store

Using a PCFG • As with HMMs, there are 3 basic questions we want to answer • The probability of the string (Language Modeling): P(w1n | G) • The most likely structure for the string (Parsing): argmaxt P(t | w1n ,G) • Estimates of the parameters of a known PCFG from training data (Learning algorithm): Find G such that P(w1n | G) is maximized • We’ll assume that our PCFG is in CNF

HMMs Probability distribution over strings of a certain length For all n: ΣW1nP(w1n ) = 1 Forward/Backward Forward αi(t) = P(w1(t-1), Xt=i) Backward βi(t) = P(wtT|Xt=i) PCFGs Probability distribution over the set of strings that are in the language L Σ LP( ) = 1 Inside/Outside Outside αj(p,q) = P(w1(p-1), Njpq, w(q+1)m | G) Inside βj(p,q) = P(wpq | Njpq, G) HMMs and PCFGs

PCFGs –hands on CS 224n / Lx 237 section Tuesday, May 4 2004

Inside Algorithm • We’re calculating the total probability of generating words wp … wq given that one is starting with the nonterminal Nj Nj NrNs wpwdwd+1wq

Inside Algorithm - Base • Base case, for rules of the form Njwk : βj(k,k) = P(wk|Njkk,G) = P(Ni  wk|G) • This deals with the lexical rules

Inside Algorithm - Inductive • Inductive case, for rules of the form : Nj Nr Ns βj(p,q) = P(wpq|Njpq,G) = Σr,sΣq-1d=p P(Nrpd,Ns(d+1)q|Njpq,G) * P(wpd|Nrpd,G) * P(w(d+1)q|Ns(d+1)q,G) = Σr,sΣdP(Nj Nr Ns) βr(p,d) βs((d+1),q) Nj NrNs wpwdwd+1wq

Inside Algorithm - Inductive • Inductive case, for rules of the form : Nj Nr Ns βj(p,q) = P(wpq|Njpq,G) = Σr,sΣq-1d=pP(Nrpd,Ns(d+1)q|Njpq,G) * P(wpd|Nrpd,G) * P(w(d+1)q|Ns(d+1)q,G) = Σr,sΣdP(Nj Nr Ns)βr(p,d) βs((d+1),q) Nj NrNs wpwdwd+1wq

Inside Algorithm - Inductive • Inductive case, for rules of the form : Nj Nr Ns βj(p,q) = P(wpq|Njpq,G) = Σr,sΣq-1d=p P(Nrpd,Ns(d+1)q|Njpq,G) * P(wpd|Nrpd,G) * P(w(d+1)q|Ns(d+1)q,G) = Σr,sΣdP(Nj Nr Ns) βr(p,d) βs((d+1),q) Nj NrNs wpwdwd+1wq

Calculating inside probabilities with CKYthe base case

Introduction to Natural Language Processing (600.465) Parsing: Introduction