360 likes | 475 Views
Learning PCFGs: Estimating Parameters, Learning Grammar Rules. Many slides are taken or adapted from slides by Dan Klein. Treebanks. An example tree from the Penn Treebank. The Penn Treebank. 1 million tokens In 50,000 sentences, each labeled with A POS tag for each token
E N D
Learning PCFGs: Estimating Parameters,Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein
Treebanks An example tree from the Penn Treebank
The Penn Treebank • 1 million tokens • In 50,000 sentences, each labeled with • A POS tag for each token • Labeled constituents • “Extra” information • Phrase annotations like “TMP” • “empty” constituents for wh-movement traces, empty subjects for raising constructions
Supervised PCFG Learning • Preprocess the treebank • Remove all “extra” information (empties, extra annotations) • Convert to Chomsky Normal Form • Possibly prune some punctuation, lower-case all words, compute word “shapes”, and other processing to combat sparsity. • Count the occurrence of each nonterminal c(N) and each observed production rule c(N->NL NR) and c(N->w) • Set the probability for each rule to the MLE: P(N->NL NR) = c(N->NL NR) / c(N) P(N->w) = c(N->w) / c(N) Easy, peasy, lemon-squeezy.
Complications • Smoothing • Especially for lexicalized grammars, many test productions will never be observed during training • We don’t necessarily want to assign these productions zero probability • Instead, define backoff distributions, e.g.: Pfinal(VPtransmogrified -> Vtransmogrified PPinto) = αP(VPtransmogrified -> Vtransmogrified PPinto) + (1-α) P(VP -> V PPinto)
Problems with Supervised PCFG Learning • Coming up with labeled data is hard! • Time-consuming • Expensive • Hard to adapt to new domains, tasks, languages • Corpus availability drives research (instead of tasks driving the research) • Penn Treebank took many person-years to manually annotate it.
Unsupervised Learning • Systems take raw data and automatically detect data • Why? • More data is available • Kids learn (some aspects of) language with no supervision • Insights into machine learning and clustering
Grammar Induction and Learnability • Some have argued that learning syntax from positive data alone is impossible • Gold, 1967: non-identifiability in the limit • Chomsky, 1980: poverty of the stimulus • Surprising result: it’s possible to get entirely unsupervised parsing to work (reasonably) well.
Learnability • Learnability: formal conditions under which a class of languages can be learned • Setup: • Class of languages Λ • Algorithm H (the learner) • H sees a sequence X of strings x1 … xn • H maps sequences X to languages L in Λ • Question is: for what classes Λ do learners H exist?
Learnability [Gold, 1967] • Criterion: Identification in the limit • A presentation of L is an infinite sequence of x’s from L in which each x occurs at least once • A learner H identifies L in the limit if, for any presentation of L, from some point n onwards, H always outputs L • A class Λ is identifiable in the limit if there is some single H which correctly identifies in the limit every L in Λ. • Example: L = {{a},{a,b}} is identifiable in the limit. • Theorem (Gold, 67): Any Λwhich contains all finite languages and at least one infinite language (ie is superfinite) is unlearnable in this sense.
Learnability [Gold, 1967] • Proof sketch • Assume Λ is superfinite, H identifies Λ in the limit • There exists a chain L1⊂ L2⊂ … L∞ • Construct the following misleading sequence • Present strings from L1 until H outputs L1 • Present strings from L2 until H outputs L2 • … • This is a presentation of L∞ but H never outputs L∞
Learnability [Horning, 1969] • Problem, IIL requires that H succeeds on all examples, even the weird ones • Another criterion: measure one identification • Assume a distribution PL(x) for each L • Assume PL(x) puts non-zero probability on all and only the x in L • Assume an infinite presentation of x drawn i.i.d. from PL(x) • H measure-one identifies L if the probability of [drawing a sequence X from which H can identify L] is 1. • Theorem (Horning, 69): PCFGs can be identified in this sense. • Note: there can be misleading sequences, but they have to be (infinitely) unlikely
Learnability [Horning, 1969] • Proof sketch • Assume Λ is a recursively enumerable set of recursive languages (e.g., the set of all PCFGs) • Assume an ordering on all strings x1 < x2 < … • Define: two sequences A and B agree through n iff for all x<xn, x is in A x is in B. • Define the error set E(L,n,m): • All sequences such that the first m elements do not agree with L through n • These are the sequences which contain early strings outside of L (can’t happen), or which fail to contain all of the early strings in L (happens less as m increases) • Claim: P(E(L,n,m)) goes to 0 as m goes to ∞ • Let dL(n) be the smallest m such that P(E) < 2–n • Let d(n) be the largest dL(n) in first n languages • Learner: after d(n), pick first L that agrees with evidence through n • This can only fail for sequences X if X keeps showing up in E(L, n, d(n)), which happens infinitely often with probability zero.
Learnability • Gold’s results say little about real learners (the requirements are too strong) • Horning’s algorithm is completely impractical • It needs astronomical amounts of data • Even measure-one identification doesn’t say anything about tree structures • It only talks about learning grammatical sets • Strong generative vs. weak generative capacity
Unsupervised POS Tagging • Some (discouraging) experiments [Merialdo 94] • Setup: • You know the set of allowable tags for each word (but not frequency of each tag) • Learn a supervised model on k training sentences • Learn P(w|t), P(ti|ti-1,ti-2) on these sentences • On n>k, reestimate with EM
Grammar Induction Unsupervised Learning of Grammars and Parameters
Right-branching Baseline • In English (but not necessarily in other languages), trees tend to be right-branching: • A simple, English-specific baseline is to choose the right chain structure for each sentence.
Learn PCFGs with EM [Lari and Young, 1990] • Setup: • Full binary grammar with n nonterminals {X1, …, Xn} (that is, at the beginning, the grammar has all possible rules) • Parse uniformly/randomly at first • Re-estimate rule expecations off of parses • Repeat • Their conclusion: it doesn’t really work
EM for PCFGs: Details • Start with a “full” grammar, with all possible binary rules for our nonterminals N1 … Nk. Designate one of them as the start symbol, say N1 • Assign some starting distribution to the rules, such as • Random • Uniform • Some “smart” initialization techniques (see assigned reading) • E-step: Take an unannotated sentence S, and compute, for all nonterminals N, NL, NR, and all terminals w: E(N | S), E(N->NL NR, N is used| S), E(N->w, N is used| S) • M-step: Reset rule probabilities to the MLE: P(N->NL NR) = E(N->NL NR|S) / E(N | S) P(N->w) = E(N->w | S) / E(N | S) • Repeat 3 and 4 until rule probabilities stabilize, or “converge”
Definitions This is the sum of P(T, S | G) over all possible trees T for w1m where the root is N1.
E-Step • We can define the expectations we want in terms of π, α, β quantities:
Inside Probabilities Base case: Induction: Nj Nl Nr wp wd wd+1 wq
Outside Probabilities Base case: Induction:
A nested distributional model • We’d like a model that • Ties spans to linear contexts (like distributional clustering) • Considers only proper tree structures (like PCFGs) • Has no symmetries to break (like a dependency model)