410 likes | 510 Views
LING 696B: Phonotactics wrap-up, OT, Stochastic OT. Remaining topics. 4 weeks to go (including the day before thanksgiving): Maximum-entropy as an alternative to OT (Jaime) Rule induction (Mans) + decision tree
E N D
Remaining topics • 4 weeks to go (including the day before thanksgiving): • Maximum-entropy as an alternative to OT (Jaime) • Rule induction (Mans) + decision tree • Morpho-phonological learning (Emily) and multiple generalizations (LouAnn’s lecture) • Learning and self-organization (Andy’s lecture)
Towards a parametric model of phonotactics • Last time: simple sequence models with some simple variations • Phonological generalization needs much more than this • Different levels: Natural classes: Bach +ed= ?; onset sl/*sr, *shl/shrAlso: position, stress, syllable, … • Different ranges: seems to be unbounded Hungarian (Hayes & Londe): ablak-nak / kert-nek; paller-nak / mutagen-nek English: *sCVC, *sNVN (skok? spab? smin?)
Towards a parametric model of phonotactics • Parameter explosion seems unavoidable • Searching over all possible natural classes? • Searching over unbounded ranges? • Data sparsity problem serious • Esp. if counting type rather than token frequency • Isolate generalization at specific positions/configurations with templates • Need theory for templates (why sCVC?) • Templates for everything? • Non-parametric/parametric boundary blurred
Towards a parametric model of phonotactics • Critical survey of literature needed • How can phonological theory constrain parametric models of phonotactics? • Homework assignment (count as 2-3): a phonotactics literature review • E.g. V-V, C-C, V-C interaction, natural classes, positions, templates, … • Extra credit if also present ideas about how they are related to modeling
OT and phonological acquisition • Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) - +
OT and phonological acquisition • Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) - +
OT and phonological acquisition • Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) • Maybe the view is wrong: forget distribution in a certain language, focus on universals
OT and phonological acquisition • Isn’t data sparsity already a familiar issue? • Old friend: “poverty of stimulus” -- training data vastly insufficient for learning the distribution (recall: the limit sample size 0) • Maybe the view is wrong: forget distribution in a certain language, focus on universals • Standard OT: generalization hard-coded, abandon the huge parameter space • Justification: only consider the ones that are plausible/attested • Learning problem made easier?
OT learning: constraint demotion • Example: English (sibilant+liquid) onset • Somewhat motivated constraints: *sh+C, *sr, Ident(s), Ident(sh). Starting equal. • Demote constraints that prefer the wrong guys *Example adapted from A. Albright
OT learning: constraint demotion • Now, pass shleez/sleez to the learner • No negative evidence: shl never appeared in English • Conservative strategy: underlying form same as the surface by default (richness of the base)
Biased constraint demotion(Hayes, Prince & Tesar) • Why the wrong generalization? • Faithfulness -- Ident(sh) is high, therefore allowing underlying sh to appear everywhere • In general: faithfulness high leads to “too much” generalization in OT • C.f. the subset principle • Recipe: keep faithfulness as low as possible, unless evidence suggests otherwise • Hope: learn the “most restrictive” language • What kind of evidence?
Remarks on OT approaches to phonotactics • The issues are never-ending • Not enough to put all F low, which F is low also matters (Hayes) • Mission accomplished? -- Are we almost getting the universal set of F and M? • Even with hard-coded generalization, still takes considerable work to fill all the gaps (e.g. sC/shC, *tl/*dl) • Why does bwa sounds better than tla (Moreton)
Two worlds • Statistical model and OT seem to ask different questions about learning • OT/UG: what is possible/impossible? • Hard-coded generalizations • Combinatorial optimization (sorting) • Statistical: among the things that are possible, what is likely/unlikely? • Soft-coded generalizations • Numerical optimization • Marriage of the two?
OT and variation • Motivation: systematic variation that leads to conflicting generalizations • Example: Hungarian again (Hayes & Londe)
Proposals on getting OT to deal with variation • Partial order rather than total order of constraints (Antilla) • Don’t predict what’s more likely than others • Floating constraints (historical OT people) • Can’t really tell what the range is • Stochastic OT (Boersma, Hayes) • Does produce a distribution • Moreover, a generative model • Somewhat unexpected complexity
Stochastic OT • Want to set up a distribution to learn. But distribution over what? • GEN? -- This does not lead to conflicting generalizations from a fixed ranking • One idea: distribution over all grammars (also see Yang’s P&P framework) • How many OT grammars? --(N!) • Lots of distributions are junk, e.g. (1,2,…N)~0.5, (N,N-1,…,1)~0.5; everything else zero • Idea: constrain the distribution over N! grammars with (N-1) ranking values
Stochastic Optimality Theory:Generation • Canonical OT C1<<C3<<C2 • Stochastic OT Sample and evaluate ordering C1 C3 C2
What is the nature of the data? • Unlike previous generative models, here the data is relational • Candidates have been “pre-digested” as violation vectors • Candidate pairs (+ frequency) contain information about the distribution over grammars • Similar scenario: estimating numerical (0-100) grades from letter grades (A-F).
Stochastic Optimality Theory:Learning • Canonical OT (C1>>C3) (C2>>C3) • Stochastic OT “ranking values”: G = (1, … , N) RN Ordinal data (D) ??? max {C1, C2} > C3 ~ .77 max {C1, C2} < C3 ~ .23
Gradual Learning Algorithm (Boersma & Hayes) • Two goals • A robust method for learning standard OT(note: arbitrary noise-polluted OT ranking is a graph cut problem -- NP) • A heuristic for learning Stochastic OT • Example: mini grammar with variation
How does GLA work • Repeat for many times (forced to stop) • Pick a winner by throwing a dice according to P(.) • Adjust constraints with a small value if the prediction doesn’t match the picked winner • Similar to training neural nets • “Propogate” error to the ranking values • Some randomness is involved in getting the error
GLA is stochastic local search • Stochastic local search: incomplete methods, often work well in practice (esp. for intractable problems), but no guarantee • Need something that • works in general
GLA as random walk • Fix the update values, then GLA behaves like a “drunken man”: • Probability of moving in each direction only depends on where you are • In general, does not “wander off” Ident(voi) Possible moves for GLA Ranking value of *[+voi]
Stationary distributions • Suppose, we have a zillion GLA running around independently, and look at their “collective answer” • If they don’t wander off, than this answer does’t change much after a while -- convergence to the stationary distribution • Equivalent to looking at many runs of just one program
The Bayesian approach to learning Stochastic OT grammars • Key idea: simulating a distribution with computer power • What is a meaningful stationary distribution? • The posterior distribution p(G|D) -- peaks at grammars that explain the data well • How to construct a random walk that will eventually reach p(G|D)? • Technique: Markov-chain Monte-Carlo
An example of Bayesian inference • Guessing the heads-on probability of a bent coin from the outcome of coin tosses Posteriorafter seeing1 head Posteriorafter seeing10 heads Posteriorafter seeing100 heads Prior
Why Bayesian? Maximum-Likelihood difficult • Need to deal with product of integrals! • Likelihood of d: “max {C1, C2} > C3” • No hope this can be done in a tractable way • Bayesian method gets around doing calculus all together
Data Augmentation Scheme for Stochastic OT • Paradoxical aspect: “more is easier” • “Missing Data” (Y): the real values of constraints that generate the ranking d G – grammar d: “max {C1, C2} > C3” Idea: simulate P(G,Y|D) is easier than P(G|D) Y – missing data
Gibbs sampling for Stochastic OT • p(G|Y,D)=p(G|Y) is easy: sampling mean from normal posterior ~ • Random number generation: P(G|Y) ~ P(Y|G)P(G) • p(Y|G,D) can also be done: fix each d, then sample Y from G , so that d holds – use rejection sampling • Another round of random generation • Gibbs sampler: iterate, and get p(G,Y|D) – works in general
Bayesian simulation: No need for integration! • Once have samples (g,y) ~ p(G,Y|D), g ~ p(G|D) is automatic Use a few starting points to monitor convergence
Bayesian simulation: No need for integration! • Once have samples (g,y) ~ p(G,Y|D), g ~ p(G|D) is automatic Joint: p(G,Y|D) Just keep the G’s Marginal: p(G|D)
Result: Stringency Hierarchy • Posterior marginal of the 3 constraints Ident(voice) *VoiceObs(coda) *VoiceObs grammar used for generation
Conditional sampling of parameters p(G|Y,D) • Given Y, G is independent of D. So p(G|Y,D) = p(G|Y) • Sampling from p(G|Y) is just regular Bayesian statistics: p(G|Y)~p(Y|G)p(G) • p(Y|G) is normal with mean \bar{y} and variance \sigma^2/m • p(G) is chosen to have infinite variance – an “uninformative” prior
Conditional sampling ofmissing data p(Y|G,d) • Idea: decompose Y into (Y_1, …, Y_N), and sample one at a time • Example: d = “max {C1, C2} > C3” • Easier than !
Conditional sampling ofmissing data p(Y|G,d) • form a random walk in R3 that approximates
Sampling tails of Gaussians • Direct sampling can be very slow Need samples from tail • For efficiency: rejection sampling with exponential density envelope Envelope Target Shape of envelope optimized for minimal rejection rate
Ilokano-like grammar • Is there a grammar that will generate p(.)? • Not obvious, since the interaction is not pair-wise. GLA always slightly off
Summary • Two perspectives on the randomized learning algorithm • A Bayesian statistics simulation • A general stochastic search scheme • Bayesian methods often provide approximate solutions to hard computational problems • The solution is exact if allowed to run forever