890 likes | 904 Views
This part of the series focuses on solving computational problems in probabilistic models, including estimating parameters in models with latent variables and computing posterior distributions involving a large number of variables.
E N D
Estimation and inference • Actually working with probabilistic models requires solving some difficult computational problems… • Two key problems: • estimating parameters in models with latent variables • computing posterior distributions involving large numbers of variables
Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables
Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables
SUPERVISED dog dog cat dog dog cat dog cat dog cat dog cat dog cat cat dog
Supervised learning Category A Category B What characterizes the categories? How should we categorize a new observation?
Parametric density estimation • Assume that p(x|c) has a simple form, characterized by parameters • Given stimuli X = x1, x2, …, xn from category c, find by maximum-likelihood estimation or some form of Bayesian estimation
Spatial representations • Assume a simple parametric form for p(x|c): a Gaussian • For each category, estimate parameters • mean • variance c P(c) x p(x|c) }
standard deviation mean The Gaussian distribution Probability density p(x) (x-)/ variance = 2
Estimating a Gaussian X = {x1, x2, …, xn}independently sampled from a Gaussian
Estimating a Gaussian X = {x1, x2, …, xn}independently sampled from a Gaussian maximum likelihood parameter estimates:
mean variance/covariance matrix quadratic form Multivariate Gaussians
Estimating a Gaussian X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates:
Bayesian inference Probability x
Unsupervised learning What latent structure is present? What are the properties of a new observation?
An example: Clustering Assume each observed xi is from a cluster ci, where ci is unknown What characterizes the clusters? What cluster does a new x come from?
Density estimation • We need to estimate some probability distributions • what is P(c)? • what is p(x|c)? • But… c is unknown, so we only know the value of x c P(c) x p(x|c)
Supervised and unsupervised Supervised learning: categorization • Given x = {x1, …, xn} and c = {c1, …, cn} • Estimate parameters of p(x|c) and P(c) Unsupervised learning: clustering • Given x = {x1, …, xn} • Estimate parameters of p(x|c) and P(c)
mixture weights Mixture distributions mixture distribution mixture components Probability x
More generally… Unsupervised learning is density estimation using distributions with latent variables z P(z) Latent (unobserved) x Marginalize out (i.e. sum over) latent structure P(x|z) Observed
A chicken and egg problem • If we knew which cluster the observations were from we could find the distributions • this is just density estimation • If we knew the distributions, we could infer which cluster each observation came from • this is just categorization
Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: 2. Given assignments ci, solve for maximum likelihood parameter estimates: 3. Go to step 1
Alternating optimization algorithm x c: assignments to cluster ,, P(c): parameters For simplicity, assume , P(c) fixed: “k-means” algorithm
Alternating optimization algorithm Step 0: initial parameter values
Alternating optimization algorithm Step 1: update assignments
Alternating optimization algorithm Step 2: update parameters
Alternating optimization algorithm Step 1: update assignments
Alternating optimization algorithm Step 2: update parameters
Alternating optimization algorithm 0. Guess initial parameter values 1. Given parameter estimates, solve for maximum a posteriori assignments ci: 2. Given assignments ci, solve for maximum likelihood parameter estimates: 3. Go to step 1 why “hard” assignments?
Estimating a Gaussian(with hard assignments) X = {x1, x2, …, xn} independently sampled from a Gaussian maximum likelihood parameter estimates:
Estimating a Gaussian(with soft assignments) the “weight” of each point is the probability of being in the cluster maximum likelihood parameter estimates:
The Expectation-Maximizationalgorithm(clustering version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over assignments ci: 2. Solve for maximum likelihood parameter estimates, weighting each observation by the probability it came from that cluster 3. Go to step 1
The Expectation-Maximizationalgorithm(more general version) 0. Guess initial parameter values 1. Given parameter estimates, compute posterior distribution over latent variablesz: 2. Find parameter estimates 3. Go to step 1
A note on expectations • For a function f(x) and distribution P(x), the expectation of f with respect to P is • The expectation is the average of f, when x is drawn from the probability distribution P
Good features of EM • Convergence • guaranteed to converge to at least a local maximum of the likelihood (or other extremum) • likelihood is non-decreasing across iterations • Efficiency • big steps initially (other algorithms better later) • Generality • can be defined for many probabilistic models • can be combined with a prior for MAP estimation
Limitations of EM • Local minima • e.g., one component poorly fits two clusters, while two components split up a single cluster • Degeneracies • e.g., two components may merge, a component may lock onto one data point, with variance going to zero • May be intractable for complex models • dealing with this is an active research topic
EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” • anywhere there’s a “chicken and egg” problem • a prime example: language learning
1.0 NP VP 0.7 1.0 T N V NP 0.7 0.8 0.5 0.6 man T N hit the 0.8 0.5 the ball Probabilistic context free grammars S S NP VP 1.0 NP T N 0.7 NP N 0.3 VP V NP 1.0 T the 0.8 T a 0.2 N man 0.5 N ball 0.5 V hit 0.6 V took 0.4 P(tree) = 1.00.71.00.80.50.60.70.80.5
EM and cognitive science • The EM algorithm seems like it might be a good way to describe some “bootstrapping” • anywhere there’s a “chicken and egg” problem • a prime example: language learning • Fried and Holyoak (1984) explicitly tested a model of human categorization that was almost exactly a version of the EM algorithm for a mixture of Gaussians
Part IV: Inference algorithms • The EM algorithm • for estimation in models with latent variables • Markov chain Monte Carlo • for sampling from posterior distributions involving large numbers of variables
The Monte Carlo principle • The expectation of f with respect to P can be approximated by where the xi are sampled from P(x) • Example: the average # of spots on a die roll
The Monte Carlo principle The law of large numbers Average number of spots Number of rolls
Markov chain Monte Carlo • Sometimes it isn’t possible to sample directly from a distribution • Sometimes, you can only compute something proportional to the distribution • Markov chain Monte Carlo: construct a Markov chain that will converge to the target distribution, and draw samples from that chain • just uses something proportional to the target
Markov chains Variables x(t+1) independent of all previous variables given immediate predecessor x(t) x x x x x x x x Transition matrix T = P(x(t+1)|x(t))
An example: card shuffling • Each state x(t) is a permutation of a deck of cards (there are 52! permutations) • Transition matrix T indicates how likely one permutation will become another • The transition probabilities are determined by the shuffling procedure • riffle shuffle • overhand • one card
Convergence of Markov chains • Why do we shuffle cards? • Convergence to a uniform distribution takes only 7 riffle shuffles… • Other Markov chains will also converge to a stationary distribution, if certain simple conditions are satisfied (called “ergodicity”) • e.g. every state can be reached in some number of steps from every other state
Markov chain Monte Carlo • States of chain are variables of interest • Transition matrix chosen to give target distribution as stationary distribution x x x x x x x x Transition matrix T = P(x(t+1)|x(t))
Metropolis-Hastings algorithm • Transitions have two parts: • proposal distribution: Q(x(t+1)|x(t)) • acceptance: take proposals with probability A(x(t),x(t+1)) = min( 1, ) P(x(t+1)) Q(x(t)|x(t+1)) P(x(t)) Q(x(t+1)|x(t))