190 likes | 397 Views
The EM algorithm Lecture #11. Acknowledgement: Some slides of this lecture are due to Nir Friedman. Expectation Maximization (EM) for Bayesian networks. Intuition (as before):
E N D
The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman. .
Expectation Maximization (EM) for Bayesian networks Intuition (as before): • When we have access to all counts, then we can find the ML estimate of all parameters in all local tables directly by counting. • However, missing values do not allow us to perform such counts. • So instead, we compute the expected counts using the current parameter assignment, and then use them to compute the maximum likelihood estimate. A B C D P(A= a | ) P(B=b|A= a, ) P(C=c|A= a, ) P(D=d|b,c, )
X Z Y N (X,Z ) N (X,Y ) X X Z Y # # 1.30.41.71.6 0.10.22.91.8 HHTT HHTT HTHT HTHT Expectation Maximization (EM) Expected Counts Data P(Y=H|X=H,Z=T, ) = 0.3 Y Z X HTHHT ??HTT T??TH 0.2 Current parameters 0.1 P(Y=H|X=T, ) = 0.4
Reiterate Updated network (G, ) Expected Counts N(X1) N(X2) N(X3) N(Y, X1, X1, X3) N(Z1, Y) N(Z2, Y) N(Z3, Y) Computation Reparameterize X1 X1 X2 X2 X3 X3 Y (M-Step) (E-Step) Y Z1 Z1 Z2 Z2 Z3 Z3 EM (cont.) Initial network (G, ’) Note: This EM iteration corresponds to the non-homogenous HMM iteration. When parameters are shared across local probability tables or are functions of each other, changes are needed. Training Data
EM in Practice Initial parameters: • Random parameters setting • “Best” guess from other source Stopping criteria: • Small change in likelihood of data • Small change in parameter values Avoiding bad local maxima: • Multiple restarts • Early “pruning” of unpromising ones
We define the relative entropy H(P||Q) for two probability distributions P and Q of a variable X (with x being a value of X) as follows: H(P||Q)= P(xi) log2(P(xi)/Q(xi)) xi Relative Entropy – a measure of difference among distributions This is a measure of difference between P(x) and Q(x). It is not a symmetric function. The distribution P(x) is assumed the “true” distribution used for taking the expectation of the log of the difference . The following properties hold: H(P||Q) 0 Equality holds if and only if P(x) = Q(x) for all x.
Average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy H(P ||Q) where Q(a,b) = Q(a) Q(b). Relative entropy also arises when choosing amongst competing models.
The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) (Because P(x,y| ) = P(x| ) P(y|x, )). The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayes network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.
The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ). The goal of EM algorithm The log-likelihood of ONE observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,)
Q( |’) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] The Expectation Operator Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y). The expectation of a function L(Y) is given by E[L(Y)] = y p(y) L(y). An example used by the EM algorithm: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)
Log P(x |) = P(y|x, ’) log P(x ,y|) - P(y|x, ’) log P(y |x, ) y y = E’[log p(x,y|)] = Q( |’) 0 (relative entropy) We now observe that • = log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ). Improving the likelihood Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields
The EM algorithm Input: A likelihood function p(x,y| ) parameterized by . Initialization: Fix an arbitrary starting value ’ Repeat E-step: Compute Q( | ’) = E’[log P(x,y| )] M-step: ’ argmax Q(| ’) Until = log P(x| ) – log P(x|’) < Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.
Each sample Completed separately. E-step: Compute Q( | ’) = E’[i log P(xi,yi| )] =iE’[log P(xi,yi| )] M-step: ’ argmax Q(| ’) The EM algorithm (with multiple independent samples) Recall the log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) For independent samples (xi, yi), i=1,…,m, we can write: i log P(xi| ) = i log P(xi,yi| ) – ilog P(yi|xi,)
Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem log P(x| ) E ’[log P(x,y| )]
The likelihood of the completed data of n points: P(x,y| ) = P(nAB,nO ,na/a,na/o, nb/b, nb/o| ) = Gene Counting Revisited (as EM) The observations: The variables X=(NA, NB, NAB, NO) with a specific assignment x = (nA,nB,nAB,nO). The hidden quantity: The variables Y=(Na/a, Na/o, Nb/b, Nb/O) with a specific assignment y = (na/a,na/o, nb/b, nb/O). The parameters: ={a,b,o }.
The E-step of Gene Counting The likelihood of the hidden data given the observed data of n points: P(y| x, ’) = P(na/a,na/o, nb/b, nb/o| nA,nB , nAB,nO,’) = P(na/a,na/o | nA,’a,’o ) P(nb/b,nb/o | nB, ’b,’o ) This is exactly the E-step we used earlier !
The M-step of Gene Counting The log-likelihood of the completed data of n points: Taking expectation wrt Y =(Na/a, Na/o, Nb/b, Nb/o) and using linearity of E yields the function Q(| ’) which we need to maximize:
Which matches the M-step we used earlier ! The M-step of Gene Counting (Cont.) We need to maximize the function: Under the constraint a+ b+ o =1. The solution (obtained using Lagrange multipliers) is given by
Outline for a different derivation of Gene Counting as an EM algorithm Define a variable X with values xA,xB,xAB,xO. Define a variable Y with values ya/a, ya/o, yb/b, yb/o, ya/b, yo/o. Examine the Bayesian network: Y X The local probability table for Y is P(ya/a| ) = a a , P(ya/o| ) = 2a o , etc. The local probability table for X given Y is P(xA | ya/q ,) = 1,P(xA | ya/o ,) = 1, P(xA | yb/o ,) = 0, etc, only 0’s and 1’s. Homework: write down for yourself the likelihood function for n independent points xi,yi, and check that the EM equations match the gene counting equations.