1 / 42

LING 696B: Mixture model and its applications in category learning

LING 696B: Mixture model and its applications in category learning. Recap from last time. G&G model: a self-organizing map (neural net) that does unsupervised learning Non-parametric approach: encoding stimuli distribution by large number of connection weights. Question from last time.

ely
Download Presentation

LING 696B: Mixture model and its applications in category learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 696B: Mixture model and its applications in category learning

  2. Recap from last time • G&G model: a self-organizing map (neural net) that does unsupervised learning • Non-parametric approach: encoding stimuli distribution by large number of connection weights

  3. Question from last time • Scaling up to the speech that infants hear: higher dimensions? • Extending the G&G network in Problem 3 to Maye&Gerken’s data? • Speech segmentation: going beyond static vowels? • Model behavior: parameter tuning, starting points, degree of model fitting, when to stop …

  4. Today’s agenda • Learning categories from distributions (Maye & Gerken) • Basic ideas of statistical estimation, consistency, maximum likelihood • Mixture model, learning with the Expectation-Maximization algorithm • Refinement of mixture model • Application in the speech domain

  5. Learning categories after minimal pairs • Idea going back as early as to Jakobson, 41: • knowing /bin/~/pin/ implies [voice] as a distinctive feature • [voice] differentiates /b/ and /p/ as two categories of English • Moreover, this predicts the order in which categories are learned • Completely falsified? (small project) • Obvious objection: early words don’t include many minimal pairs

  6. Maye & Gerken, 00 • Categories can be learned from statistics, just like learning statistics from sequences • Choice of artificial contrast: English d and (s)t • Small difference invoicing and F0 • Main difference: F1, F2 onset

  7. Detecting d~(s)t contrast in Pegg and Werker, 97 • Most adults can do this, but not as good as a native contrast • 6-8m much better than 10-12m • (Need more than distribution learning?)

  8. Maye & Gerken, 00 • Training on monomodal v.s. bimodal distributions Both groups heard the same number of stimuli

  9. Maye & Gerken, 00 • Results from Maye thesis:

  10. Maye, Gerken & Werker, 02 • Similar experiment done more carefully on infants • Preferential looking time • Alternating and non-alternating trials alternating alternating Non-alter

  11. Maye, Gerken & Werker, 02 • Bimodal-trained infants look longer at alternating trials than non-alternating Not significant Difference significant

  12. Reflections • The dimension in which bimodal differs monomodal is abstract • Shape of distribution also hard to characterize • Adults/infants are not told what categories are there to learn • Neither do they know how many categories to learn • Machine learning does not have satisfying answers to all these questions

  13. Statistical estimation • Basic setup: • The world: distributions p(x; ),  is set of free parameters “all models may be wrong, but some are useful” • Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) ) • Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations • Model-fitting: based on some examples X, make guesses (learning, inference) about 

  14. Statistical estimation • Example: • Assuming people’s height follows normal distributions (mean, var) • p(x; ) = the probability density function of normal distribution • Observation: measurements of people’s height • Goal: estimate parameters of the normal distribution

  15. Statistical estimation:Hypothesis space matters • Example: curve fitting with polynomials

  16. Criterion of consistency • Many model fitting criteria • Least squares • Minimal classification errors • Measures of divergence, etc. • Consistency: as you get more and more data x1, x2, …, xN (N -> infinite), your model fitting procedure should produce an estimate that is closer and closer to the true that generates X.

  17. Maximum likelihood estimate (MLE) • Likelihood function: examples xi are independent of one another, so • Among all the possible values of , choose the so that L() is the biggest Consistent! L() 

  18. MLE for Gaussian distributions • Parameters: mean and variance • Distribution function: • MLE for mean and variance • Exercise: derive this result in 2 dimensions

  19. Mixture of Gaussians • An extension of Gaussian distributions to handle data containing categories • Example: mixture of 2 Gaussian distributions • More concrete example: height of male and female follow two distributions, but we don’t know the gender from which measurement is made

  20. Mixture of Gaussians • More parameters • Parameters of the two Gaussians: (1, 1) and (2, 2) -- two categories • The “mixing” proportion: 0    1 • How are data generated? • Throw a coin with heads-on probability  • If head is on, generate an example from the first Gaussian, otherwise generate from the second

  21. Maximum likelihood:Supervised learning • Seeing data x1, x2, …, xN (height) as well as their category membership y1, y2, …, yN (male or female) • MLE : • For each Gaussian, estimate based on members of category, e.g. • = (number of 1) / N

  22. Maximum likelihood:Unsupervised learning • Only seeing data x1, x2, …, xN , no idea about category membership or  • Must estimatebased on X only • Key idea: relate this problem to the supervised learning

  23. The K-means algorithm • Clustering algorithm for designing “codebooks” (vector quantization) • Goal: dividing data into K clusters and representing each cluster by its center • First: random guesses about cluster membership (among 1,…,K)

  24. The K-means algorithm • Then iterate • Update the center of each cluster by the mean of data belonging to the cluster • Re-assign each datum to the cluster based on the shortest distance to the cluster centers • After some iterations, this will not change any more

  25. K-means demo • Data generated from mixture of 2 Gaussians with mixing proportion 0.5

  26. Why does K-means work? • In the beginning, the centers are poorly chosen, so the clusters overlap a lot • But if centers are moving away from each other, then clusters tend to separate better • Vice versa, if clusters are well-separated, then the centers will stay away from each other • Intuitively, these two steps “help each other”

  27. /t/? /d/? [?] [?] [?] Expectation-Maximization algorithm • Replacing the “hard” assignments in K-means with “soft” assignments • Hard: (0, 1) or (1, 0) • Soft: (p( /t/ | x), p( /d/ | x)), e.g. (0.5, 0.5)  = ?

  28. /t/0 /d/0 [?] [?] [?] Expectation-Maximization algorithm • Initial guesses  0 = 0.5

  29. /t/0 /d/0 Expectation-Maximization algorithm • Expectation step: Sticking in “soft” labels -- a pair (wi, 1-wi)  0 = 0.5 [?] [0.5 t0.5 d] [?]

  30. /t/0 /d/0 Expectation-Maximization algorithm • Expectation step • Expectation step: label each example  0 = 0.5 [?] [0.5 t0.5 d] [0.3 t 0.7 d]

  31. /t/0 /d/0 Expectation-Maximization algorithm • Expectation step • Expectation step: label each example  0 = 0.5 [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]

  32. /t/1 /d/0 Expectation-Maximization algorithm • Maximization step: going back to update the model with Maximum-Likelihood, weighted by soft labels [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]

  33. /t/1 /d/1 Expectation-Maximization algorithm • Maximization step • Maximization step: going back to update the model with Maximum-Likelihood , weighted by soft labels [0.1 t0.9 d] [0.5 t0.5 d] [0.3 t0.7 d]

  34. /t/1 /d/1 Expectation-Maximization algorithm • Maximization step: going back to update the model with Maximum-Likelihood  1 = 0.3  = (0.5+0.3+0.1)/3 [0.1 t 0.9 d] [0.5 t0.5 d] [0.3 t 0.7 d]

  35. Common intuition behind K-means and EM • The labels are important, yet not observable – “hidden variables” / “missing data” • Strategy: make probability-based guesses, and iteratively guess – update until converge • K-means: hard guess 1,…,K • EM: soft guess (w1,…,wK), w1+…+wK=1

  36. Thinking of this as an exemplar-based model • Johnson (1997)'s exemplar model of categories: • When a new stimulus comes in, its membership is jointly determined by all pre-memorized exemplars.-- This is the E – step • After a new stimulus is memorized, the “weight” of each exemplar is updated. -- This is the M – step

  37. Convergence guarantee of EM • E-step: finding a lower bound of L() L() E: choosing this 

  38. Convergence guarantee of EM • M-step: finding the maximum of this lower bound L() M: finding the maximum Always <= L() 

  39. Convergence guarantee of EM • E-step again L() 

  40. Local maxima What if you start Here?

  41. Overcoming local maxima:Multiple starting points Multiple starting points

  42. Overcoming local maxima:Model refinement • Guess 6 at once is hard, but 2 is easy; • Hill climbing strategy: starting with 2, then 3, 4, ... • Implementation: split the cluster with the maximum gain in likelihood; • Intuition: discriminate within the biggest pile.

More Related