420 likes | 537 Views
LING 696B: Mixture model and its applications in category learning. Recap from last time. G&G model: a self-organizing map (neural net) that does unsupervised learning Non-parametric approach: encoding stimuli distribution by large number of connection weights. Question from last time.
E N D
LING 696B: Mixture model and its applications in category learning
Recap from last time • G&G model: a self-organizing map (neural net) that does unsupervised learning • Non-parametric approach: encoding stimuli distribution by large number of connection weights
Question from last time • Scaling up to the speech that infants hear: higher dimensions? • Extending the G&G network in Problem 3 to Maye&Gerken’s data? • Speech segmentation: going beyond static vowels? • Model behavior: parameter tuning, starting points, degree of model fitting, when to stop …
Today’s agenda • Learning categories from distributions (Maye & Gerken) • Basic ideas of statistical estimation, consistency, maximum likelihood • Mixture model, learning with the Expectation-Maximization algorithm • Refinement of mixture model • Application in the speech domain
Learning categories after minimal pairs • Idea going back as early as to Jakobson, 41: • knowing /bin/~/pin/ implies [voice] as a distinctive feature • [voice] differentiates /b/ and /p/ as two categories of English • Moreover, this predicts the order in which categories are learned • Completely falsified? (small project) • Obvious objection: early words don’t include many minimal pairs
Maye & Gerken, 00 • Categories can be learned from statistics, just like learning statistics from sequences • Choice of artificial contrast: English d and (s)t • Small difference invoicing and F0 • Main difference: F1, F2 onset
Detecting d~(s)t contrast in Pegg and Werker, 97 • Most adults can do this, but not as good as a native contrast • 6-8m much better than 10-12m • (Need more than distribution learning?)
Maye & Gerken, 00 • Training on monomodal v.s. bimodal distributions Both groups heard the same number of stimuli
Maye & Gerken, 00 • Results from Maye thesis:
Maye, Gerken & Werker, 02 • Similar experiment done more carefully on infants • Preferential looking time • Alternating and non-alternating trials alternating alternating Non-alter
Maye, Gerken & Werker, 02 • Bimodal-trained infants look longer at alternating trials than non-alternating Not significant Difference significant
Reflections • The dimension in which bimodal differs monomodal is abstract • Shape of distribution also hard to characterize • Adults/infants are not told what categories are there to learn • Neither do they know how many categories to learn • Machine learning does not have satisfying answers to all these questions
Statistical estimation • Basic setup: • The world: distributions p(x; ), is set of free parameters “all models may be wrong, but some are useful” • Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) ) • Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations • Model-fitting: based on some examples X, make guesses (learning, inference) about
Statistical estimation • Example: • Assuming people’s height follows normal distributions (mean, var) • p(x; ) = the probability density function of normal distribution • Observation: measurements of people’s height • Goal: estimate parameters of the normal distribution
Statistical estimation:Hypothesis space matters • Example: curve fitting with polynomials
Criterion of consistency • Many model fitting criteria • Least squares • Minimal classification errors • Measures of divergence, etc. • Consistency: as you get more and more data x1, x2, …, xN (N -> infinite), your model fitting procedure should produce an estimate that is closer and closer to the true that generates X.
Maximum likelihood estimate (MLE) • Likelihood function: examples xi are independent of one another, so • Among all the possible values of , choose the so that L() is the biggest Consistent! L()
MLE for Gaussian distributions • Parameters: mean and variance • Distribution function: • MLE for mean and variance • Exercise: derive this result in 2 dimensions
Mixture of Gaussians • An extension of Gaussian distributions to handle data containing categories • Example: mixture of 2 Gaussian distributions • More concrete example: height of male and female follow two distributions, but we don’t know the gender from which measurement is made
Mixture of Gaussians • More parameters • Parameters of the two Gaussians: (1, 1) and (2, 2) -- two categories • The “mixing” proportion: 0 1 • How are data generated? • Throw a coin with heads-on probability • If head is on, generate an example from the first Gaussian, otherwise generate from the second
Maximum likelihood:Supervised learning • Seeing data x1, x2, …, xN (height) as well as their category membership y1, y2, …, yN (male or female) • MLE : • For each Gaussian, estimate based on members of category, e.g. • = (number of 1) / N
Maximum likelihood:Unsupervised learning • Only seeing data x1, x2, …, xN , no idea about category membership or • Must estimatebased on X only • Key idea: relate this problem to the supervised learning
The K-means algorithm • Clustering algorithm for designing “codebooks” (vector quantization) • Goal: dividing data into K clusters and representing each cluster by its center • First: random guesses about cluster membership (among 1,…,K)
The K-means algorithm • Then iterate • Update the center of each cluster by the mean of data belonging to the cluster • Re-assign each datum to the cluster based on the shortest distance to the cluster centers • After some iterations, this will not change any more
K-means demo • Data generated from mixture of 2 Gaussians with mixing proportion 0.5
Why does K-means work? • In the beginning, the centers are poorly chosen, so the clusters overlap a lot • But if centers are moving away from each other, then clusters tend to separate better • Vice versa, if clusters are well-separated, then the centers will stay away from each other • Intuitively, these two steps “help each other”
/t/? /d/? [?] [?] [?] Expectation-Maximization algorithm • Replacing the “hard” assignments in K-means with “soft” assignments • Hard: (0, 1) or (1, 0) • Soft: (p( /t/ | x), p( /d/ | x)), e.g. (0.5, 0.5) = ?
/t/0 /d/0 [?] [?] [?] Expectation-Maximization algorithm • Initial guesses 0 = 0.5
/t/0 /d/0 Expectation-Maximization algorithm • Expectation step: Sticking in “soft” labels -- a pair (wi, 1-wi) 0 = 0.5 [?] [0.5 t0.5 d] [?]
/t/0 /d/0 Expectation-Maximization algorithm • Expectation step • Expectation step: label each example 0 = 0.5 [?] [0.5 t0.5 d] [0.3 t 0.7 d]
/t/0 /d/0 Expectation-Maximization algorithm • Expectation step • Expectation step: label each example 0 = 0.5 [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]
/t/1 /d/0 Expectation-Maximization algorithm • Maximization step: going back to update the model with Maximum-Likelihood, weighted by soft labels [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]
/t/1 /d/1 Expectation-Maximization algorithm • Maximization step • Maximization step: going back to update the model with Maximum-Likelihood , weighted by soft labels [0.1 t0.9 d] [0.5 t0.5 d] [0.3 t0.7 d]
/t/1 /d/1 Expectation-Maximization algorithm • Maximization step: going back to update the model with Maximum-Likelihood 1 = 0.3 = (0.5+0.3+0.1)/3 [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]
Common intuition behind K-means and EM • The labels are important, yet not observable – “hidden variables” / “missing data” • Strategy: make probability-based guesses, and iteratively guess – update until converge • K-means: hard guess 1,…,K • EM: soft guess (w1,…,wK), w1+…+wK=1
Thinking of this as an exemplar-based model • Johnson (1997)'s exemplar model of categories: • When a new stimulus comes in, its membership is jointly determined by all pre-memorized exemplars.-- This is the E – step • After a new stimulus is memorized, the “weight” of each exemplar is updated. -- This is the M – step
Convergence guarantee of EM • E-step: finding a lower bound of L() L() E: choosing this
Convergence guarantee of EM • M-step: finding the maximum of this lower bound L() M: finding the maximum Always <= L()
Convergence guarantee of EM • E-step again L()
Local maxima What if you start Here?
Overcoming local maxima:Multiple starting points Multiple starting points
Overcoming local maxima:Model refinement • Guess 6 at once is hard, but 2 is easy; • Hill climbing strategy: starting with 2, then 3, 4, ... • Implementation: split the cluster with the maximum gain in likelihood; • Intuition: discriminate within the biggest pile.