950 likes | 1.11k Views
2010 Winter School on Machine Learning and Vision Sponsored by Canadian Institute for Advanced Research and Microsoft Research India With additional support from Indian Institute of Science, Bangalore and The University of Toronto, Canada. Outline.
E N D
2010 Winter School on Machine Learning and VisionSponsored byCanadian Institute for Advanced Researchand Microsoft Research IndiaWith additional support fromIndian Institute of Science, Bangaloreand The University of Toronto, Canada
Outline • Approximate inference: Mean field and variationalmethods • Learning generative models of images • Learning ‘epitomes’ of images
Part AApproximate inference: Mean field and variational methods
Line processes for binary images(Geman and Geman 1984) Function, f Patterns with highf 0 0 1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 1 0 Patterns with lowf 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Under P, “lines” are probable 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
Part BLearning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research
Generative models • Generative models are trained to explain many different aspects of the input image • Using an objective function like log P(image), a generative model benefits by account for all pixels in the image • Contrast to discriminative models trained in a supervised fashion (eg, object recognition) • Using an objective function like log P(class|image), a discriminative model benefits by accounting for pixel features that distinguish between classes
What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …
Maximum likelihood learning when all variables are visible (complete data) • Suppose we observe N IID training cases v(1)…v(N) • Let q be the parameters of a model P(v) • Maximum likelihood estimate of q: qML = argmaxqPnP(v(n)|q) = argmaxq log( PnP(v(n)|q) ) = argmaxqSn log P(v(n)|q)
Complete data in Bayes nets • All variables are observed, so P(v|q) = PiP(vi|pai,qi) where pai= parents of vi, qi parameterizes P(vi|pai) • Since argmax () = argmax log (), qiML=argmaxqiSn log P(v(n)|q) = argmaxqiSnSi log P(vi(n)|pai(n),qi) = argmaxqiSn log P(vi(n)|pai(n),qi) Each child-parent module can be learned separately
Example: Learning a Mixture of Gaussians from labeled data • Recall: For cluster k, the probability density of x is The probability of cluster k is p(zk = 1) =pk • Complete data: Each training case is a (zn,xn) pair, let Nk be the number of cases in class k • ML estimation: , That is, just learn one Gaussian for each class of data
Example: Learning from complete data, a continuous child with continuous parents • Estimation becomes a regression-type problem • Eg, linear Gaussian model: P(vi|pai,qi) = N (vi; wi0+Sn:onpaiwinvn,Ci), • mean = linear function of parents • Estimation: Linear regression
Learning fully-observed MRFs • It turns out we can NOT directly estimate each potential using only observations of its variables • P(v|q) = Piϕ(vCi|qi) / (SvPiϕ(vCi|qi)) • Problem: The partition function (denominator)
Example: Mixture of K unit-variance Gaussians P(x) =Skpkaexp(-(x-m1)2/2), where a = (2p)-1/2 The log-likelihood to be maximized is log(Skpkaexp(-(x-m1)2/2)) The parameters {pk,mk} that maximize this do not have a simple, closed form solution • One approach: Use nonlinear optimizer • This approach is intractable if the number of components is too large • A different approach…
The expectation maximization (EM) algorithm(Dempster, Laird and Rubin 1977) • Learning was more straightforward when the data was complete • Can we use probabilistic inference (compute P(h|v,q)) to “fill in” the missing data and then use the learning rules for complete data? • YES: This is called the EM algorithm
Expectation maximization (EM) algorithm • Initialize q (randomly or cleverly) • E-Step: Compute Q(n)(h) = P(h|v(n),q) for hidden variables h, given visible variables v • M-Step: Holding Q(n)(h) constant, maximize SnShQ(n)(h) log P(v(n),h|q) wrtq • Repeat E and M steps until convergence • Each iteration increases log P(v|q) = Snlog(ShP(v,h|q)) “Ensemble completion”
EM in Bayesian networks • Recall P(v,h|q) = PiP(xi|pai,qi), x = (v,h) • Then, maximizing SnShQ(n)(h) log P(v(n),h|q) wrtqi becomes equivalent to maximizing, for each xi, SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) where Q(..., xk=xk*,…)=0 if xk is observed to be xk* • GIVEN the Q-distributions, the conditional P-distributions can be updated independently
EM in Bayesian networks • E-Step: Compute Q(n)(xi,pai) = P(xi,pai|v(n),q) for each variable xi • M-Step: For each xi, maximize SnSxi,paiQ(n)(xi,pai) log P(xi|pai,qi) wrtqi
Recall: For labeled data,g(znk)=znk EM for a mixture of Gaussians • Initialization: Pick m’s, S’s, p ’s randomly but validly • E Step: For each training case, we need q(z) = p(z|x) = p(x|z)p(z) / (Sz p(x|z)p(z)) Defining = q(znk=1), we need to actually compute: • M Step: Do it in the log-domain!
EM for mixture of Gaussians: E step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, z Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.52 m1= F1= p1= 0.5, c 0.48 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.51 m1= F1= p1= 0.5, c 0.49 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.48 m1= F1= p1= 0.5, c 0.52 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: E step P(c|z) c=1 0.43 m1= F1= p1= 0.5, c 0.57 c=2 m2= F2= p2= 0.5, z= Images from data set
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set m1 to the average of zP(c=1|z) z Set m2 to the average of zP(c=2|z)
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)
EM for mixture of Gaussians: M step m1= F1= p1= 0.5, c m2= F2= p2= 0.5, Set F1 to the average of diag((z-m1)T (z-m1))P(c=1|z) z Set F2 to the average of diag((z-m2)T (z-m2))P(c=2|z)
… after iterating to convergence: m1= F1= p1= 0.6, c m2= F2= p2= 0.4, z
Gibbs free energy • Somehow, we need to move the log() function in the expression log(ShP(h,v)) inside the summation to obtain log P(h,v), which simplifies • We can do this using Jensen’s inequality: Free energy
Properties of free energy • F ≥ - log P(v) • The minimum of F w.r.t Q gives F = - log P(v) Q(h) = P(h|v) = -
Proof that EM maximizes log P(v)(Neal and Hinton 1993) • E-Step: By setting Q(h)=P(h|v), we make the bound tight, so that F = - log P(v) • M-Step: By maximizing ShQ(h) logP(h,v) wrt the parameters of P, we are minimizing F wrt the parameters of P Since -log Pnew(v) ≤ Fnew≤ Fold = -log Pold(v), we have log Pnew(v) ≥ log Pold(v). ☐ = -
Generalized EM • M-Step: Instead of minimizing F wrt P, just decrease F wrt P • E-Step: Instead of minimizing F wrt Q (ie, by setting Q(h)=P(h|v)), just decrease F wrt Q • Approximations • Variational techniques (which decrease F wrt Q) • Loopy belief propagation (note the phrase “loopy”) • Markov chain Monte Carlo (stochastic …) = -
Summary of learning Bayesian networks • Observed variables decouple learning in different conditional PDFs • In contrast, hidden variables couple learning in different conditional PDFs • Learning models with hidden variables entails iteratively filling in hidden variables using exact or approximate probabilistic inference, and updating every child-parent conditional PDF
Back to…Learning Generative Models of Images Brendan Frey University of Toronto and Canadian Institute for Advanced Research
What constitutes an “image” • Uniform 2-D array of color pixels • Uniform 2-D array of grey-scale pixels • Non-uniform images (eg, retinal images, compressed sampling images) • Features extracted from the image (eg, SIFT features) • Subsets of image pixels selected by the model (must be careful to represent universe) • …
Experiment: Fitting a mixture of Gaussians to pixel vectors extracted from complicated images Model size 1 class 2 classes 3 classes 4 classes
Why didn’t it work? • Is there a bug in the software? • I don’t think so, because the log-likelihood monotonically increases and the software works properly for toy data generated from a mixture of Gaussians • Is there a mistake in our mathematical derivation? • The EM algorithm for a mixture of Gaussians has been studied by many people – I think the math is ok
Why didn’t it work? • Are we missing some important hidden variables? • YES: The location of each object
z= T= x= Transformed mixtures of Gaussians (TMG)(Frey and Jojic, 1999-2001) c P(c) =pc c=1 m1= p1= 0.6, diag(F1) = m2= p2= 0.4, diag(F2) = Shift, T z T P(T) P(x|z,T) = N(x; Tz, Y) x Diagonal
EM for TMG • E step: Compute Q(T)=P(T|x), Q(c)=P(c|x), Q(c,z)=P(z,c|x) and Q(T,z)=P(z,T|x) for each x in data • M step: Set • pc = avg of Q(c) • rT = avg of Q(T) • mc = avg mean of z under Q(z|c) • Fc = avg variance of z under Q(z|c) • Y = avgvar of x-Tz under Q(T,z) c T z x
Random initialization Experiment: Fitting transformed mixtures of Gaussians to complicated images Model size 1 class 2 classes 3 classes 4 classes
Let’s peek into the Bayes net (different movie) P(c|x) margmaxcP(c|x) argmaxTP(T|x) E[z|x] E[Tz|x] x