430 likes | 450 Views
An overview of the Expectation Maximization (EM) algorithm for Hidden Markov Models (HMMs) and motif discovery. Includes interpretation of the Baum-Welch algorithm, optimization strategies, convergence, why EM is used, and generalized EM.
E N D
Expectation-Maximizationfor HMMs and Motif Discovery Yves Moreau
Overview • The general Expectation-Maxization algorithm • EM interpretation of the Baum-Welch algorithm for the learning of HMMs • MEME for motif discovery
EM algorithm • Maximum likelihood estimation • Let us assume we have an algorithm that tries to optimize the likelihood • Let us look at the change in likelihood between two iterations of the algorithm
EM algorithm • The likelihood is sometimes difficult to compute • We use a simpler generative model based on unobserved data (data augmentation) • We try to integrate out the unobserved data • The expectation can be an integral or a sum
EM algorithm • Without loss of generality, we work with a sum • Problem: the expression contains the logarithm of a sum • Jensen’s inequality
EM algorithm • Application of Jensen’s inequality • This gives a lower bound for the variation of the likelihood
EM algorithm • Let us try to maximize (the bound on) the variation independent of q
EM for independent records • If the data set consists of Nindependent records, we can introduce independent unobserved data • The expectation step (including the use of Jensen’s inequality) takes place “inside” the summation over all records
Convergence of the EM algorithm • Likelihood increases monotonically • + Equilibrium q* of EM is maximum of lnP(D|q*)+ d(q,q*) • Thus q* must be a stationary point of the likelihood (because the bound is tangent to the log-likelihood) • No guarantee for a global optimum (often local minima) • In some cases the stationary point is not a even maximum
Why EM? • EM serves to find a maximum likelihood solution • This can also be achieved by gradient descent • But the computation of the gradients of the likelihood P(D|q) is often difficult • By introducing the unobserved data in the EM algorithm, we can compute the Expectation step more easily
Generalized EM • It is not absolutely necessary to maximize Q(q) at the Expectation step • If Q(qi+1) Q(qi), convergence can also be achieved • This is the generalized EM algorithm • This algorithm is applied when the results of the Expectation step are too complex to maximize directly
Sequence score Transition probabilities .4 A .2C .4 G .2 T .2 Emission probabilities .6 .6 A .8C 0 G 0 T .2 A 0C .8 G .2 T 0 A .8C .2 G 0 T 0 A 1C 0 G 0 T 0 A 0 C 0 G .2 T .8 A 0C .8 G .2 T 0 1.0 1.0 1.0 .4 1.0 Hidden Markov Models
Hidden Markov Model • In a hidden Markov model, we observe the symbol sequence x but we want to reconstruct the hidden state sequence (pathp) • Transition probabilities (a: a0l, w: ak0) • Emission probabilities • Joint probability of the sequence a,x1,...,xL,w and the path
The forward algorithm • The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM • This is important for the computation of posterior probabilities and the comparison of HMMs • The sum over all paths (exponentially many) can be computed by dynamic programming • Les us define fk(i) as the probability of the sequence for the paths that end in state k with the emission of symbol xi • Then we can compute this probability as
The backward algorithm • The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol xi is emitted from state k • This is important to compute the probability of a given state at symbol xi • P(x1,...,xi,pi=k) can be computed by the forward algorithm fk(i) • Let us define bk(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol xi
EM interpretation of Baum-Welch • We want to estimate the parameters of the hidden Markov model (transition probabilities en emission probabilities that maximize the likelihood of the sequence(s) • Unobserved data = paths p: • EM algorithm
EM interpretation of Baum-Welch • Let us work out the function Q further • The generative model gives the joint probability of the sequence and the path • Define the number of times that a given probability gets used for a given path • Define the number of times that a given emission is observed for a given sequence and a given path
EM interpretation van Baum-Welch • The joint probability of the sequence and the path can be written as • By taking the logarithm, the function Q becomes
EM interpretation van Baum-Welch • Define the expected number of times that a transition gets used (independently of the path) • Define the expected number of times that a transition is observed (independently of the path)
EM interpretation of Baum-Welch • For the function Q, we have • Given that P(x,p|q) is independent of k and b, we can reorder the sums and use the definitions of Akl and Ek(b) • Let us now maximize Q w.r.t. q : akl, ek(b)
EM interpretation of Baum-Welch • Let us look at the A term • Let us define the following candidate for the optimum • Compare with other parameter choices
EM interpretation Baum-Welch • The previous sum has the form of a relative entropy and is always positive • Our candidate maximize thus the A term • Identical procedure for the E term
EM interpretation of Baum-Welch • Baum-Welch • Expectation step • Compute the expected number of times that a transition gets used • Compute the expected number of times that an emission is observed • Use the forward and backward algorithm for this • Maximization step • Update the parameters with the normalized counts
EM interpretation of Baum-Welch • For the transitions • For the emissions
Combinatorial control • Complex integration of multiple cis-regulatory signals controls gene activity
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
Iterative motif discovery • Initialization • Sequences • Random motif matrix • Iteration • Sequence scoring • Alignment update • Motif instances • Motif matrix • Termination • Convergence of the alignment and of the motif matrix
MEME • Expectation-Maximization • Data = set of independent sequences • Likelihood = “one occurrence per sequence” model • Parameters = motif matrix (+ background model) • Missing data = alignment
MEME • Sequence scoring (per sequence) • Uniform prior • Sequence scoring for uniform prior
MEME • Expectation • Maximization - intuitively • If we had only one alignment • Background model: observed frequences at background positions • Motif matrix: observe frequentiesat aligned positions • Here: sum over all possible alignments (independently for each sequence) • Weighted sum:
Summary • The abstract Expectation-Maximization algorithm • EM interpretation of Baum-Welch training for HMMs • EM for motif finding • MEME