LING 696B: Gradient phonotactics and well-formedness

LING 696B: Gradient phonotactics and well-formedness

Vote on remaining topics • Topics that have been fixed: • Morpho-phonological learning (Emily) + (LouAnn’s lecture) + Bayesian learning • Rule induction (Mans) + decision tree • Learning and self-organization (Andy’s lecture)

Voting on remaining topics • Select 2-3 from the following (need a ranking): • OT and Stochastic OT • Alternatives to OT: random fields/maximum entropy • Minimal Description Length word chopping • Feature-based lexical access

Well-formedness of words (following Mike’s talk) • A word “sounds like English” if: • It is a close neighbor of some words that sound really English. E.g. “pand” is neighbor of sand, band, pad, pan, … • It agrees with what English grammar says what an English word should look like, e.g. gradient phonotactics says blick > bnick

Well-formedness of words (following Mike’s talk) • A word “sounds like English” if: • It is a close neighbor of some words that sound really English. E.g. “pand” is neighbor of sand, band, pad, pan, … • It agrees with what English grammar says what an English word should look like, e.g. gradient phonotactics says blick > bnick • Today: relate these two ideas to the non-parametric and parametric perspectives

Many ways of calculating probability of a sequence • Unigrams, bigrams, trigrams, syllable parts, transition probabilities … • No bound on the number of creative ways

Many ways of calculating probability of a sequence • Unigrams, bigrams, trigrams, syllable parts, transition probabilities … • No bound on the number of creative ways • What does it mean to say the “probability” of a phonological word? • Objective/frequentist v.s. subjective/ Bayesian: philosophical (but important)

Many ways of calculating probability of a sequence • Unigrams, bigrams, trigrams, syllable parts, transition probabilities … • No bound on the number of creative ways • What does it mean to say the “probability” of a phonological word? • Objective/frequentist v.s. subjective/ Bayesian: philosophical (but important) • Thinking “parametrically” may clarify things • “likelihood” = “probability” calculated from a model

Parametric approach to phonotactics • Example: “bag of sounds” assumption/ exchangable distributions • p(blik) = p(lbik) = p(kbli)

Parametric approach to phonotactics • Example: “bag of sounds” assumption/ exchangable distributions • p(blik) = p(lbik) = p(kbli) • Unigram models: N-1 parameters What is ? How to get (hat)? How to assign prob to “blick”? B L I K

Parametric approach to phonotactics • Unigram model with overlapping observations: N2 - 1 parameters What is ? How to get (hat)? How to assign prob to “blick”? B L I K Note: input is #B BL LI IK K#

Parametric approach to phonotactics • Unigram with annotated observations (Coleman and Pierrehumbert) “rsif” “osif” Input: segment annotated with a syllable parse BL IK Onset of strong Initial/final syllable Rhyme of strong Initial/final syllable

Parametric approach to phonotactics • Bigram model: N(N-1) parameters {p(wn|wn-1)} (how many for trigram?) B L I K Input: segment sequence

Ways that theory might help calculate probability • Probability calculation must be based on an explicit model • Need a story about what sequences are • How can phonology help with calculating sequence probability? • More delicate representations • More complex models

Ways that theory might help calculate probability • Probability calculation must be based on an explicit model • Need a story about what sequences are • How can phonology help with calculating sequence probability? • More delicate representations • More complex models • But: phonology is not quite about what sequences are …

More delicate representations • Would CV phonology help? • Auto-segmental tiers, features, gestures? • The chains no longer independent: more sophisticated models are needed • Limit: generative model of speech production (very hard) B L I K I T

More complex models • Mixture of unigrams • Used in document classification Lexical strata Unigram B L I K

More complex models • More structure in the Markov chain • Can also model the length distribution with the so-called semi-Markov models “rhyme VC” “rhyme V” “onset” BL IK

More complex models • Probabilistic context free grammar • Syllable --> C + VC (0.6) • Syllable --> C + V (0.35) • Syllable --> C + C (0.05) • C --> _ (0.01) • C --> b (0.05) • … • See 439/539

What’s the benefit for doing more sophisticated things? • Recall: maximum likelihood need more data to produce a better estimate • Data sparsity problem: training data often insufficient for estimating all the parameters, e.g. zero counts • Lexicon size: we don’t have infinitely many words to estimate phonotactics • Smoothing: properly done, has a Bayesian interpretation (often not)

Probability and well-formedness • Generative modeling: characterize a distribution over strings • Why should we care about this distribution? • Hope: this may have something to do with grammaticality judgements • But: judgements also affected by what other words “sound like”. • Puzzle of mrupect/mrupation • It may be easier to model a function with input = string, output = judgements

Bailey and Hahn • Tried all kinds of ways of calculating phonotatics and neighborhood density, and see which combination “works the best” • Typical reasoning: “metric X and Y as factors explain 15% variance”

Bailey and Hahn • Tried all kinds of ways of calculating phonotatics and neighborhood density, and see which combination “works the best” • Typical reasoning: “metric X and Y as factors explain 15% variance” • Methodology: ANOVA • Model (1-way): data = overall mean + effect + error • What can ANOVA do for us? • How do we check if ANOVA makes sense? • What is the “explained variance”?

Non-parametric approach to similarity neighborhood • A hint from B&H: the neighborhood model • dij is weighted edit distance • A,B,C,D estimated from polynomial regression • Recall: radial basis functions F(x) = i ai K(x, xi), with K(x, xi) = e -d(x, xi) • Quadratic weighting ad hoc, should just do general nonlinear regression with RBF

Non-parametric approach to similarity neighborhood • Recall: RBF as a “soft” neighborhood model • Now think of strings also as data points, with neighborhood defined by some string distance (e.g. edit) • Same kind of regression with RBF

Non-parametric approach to similarity neighborhood • Key technical point: choosing the right kernel • Edit-distance kernel: K(x, xi) = e -edit(x, xi) • Sub-string kernel: measuring the length of common sub-sequence (mrupation) • Key experimental data: controlled stimuli, split into training and test sets (equal phonotactic prob) • No need to transform rating scale

Non-parametric approach to similarity neighborhood • An enterprise of questions open up with the non-parametric perspective: • Would yes/no task lead to word “anchor” like support vectors? • Would the new words interact with each other, as seen in the transductive inference? • What type of metric most appropriate for inferring well-formedness from neighborhoods?

Integration • Hard to integrate with a probabilistic (parametric) model • Neighborhood density has a strong non-parametric character -- grows with data • Possible to integrate phonotactic prob in a non-parametric model: kernel algebra • aK1(x,y) + bK2(x,y), K1(x,y)*K2(x,y) are also kernels • p kernel: K(x1, x2)= i p(x2|h)p(x1|h)p(h) p comes from parametric model

LING 696B: Gradient phonotactics and well-formedness