420 likes | 579 Views
Parameter Estimation For HMM. Background Readings : Chapter 3.3 in the book, Biological Sequence Analysis , Durbin et al., 2001. M. M. M. M. S 1. S 2. S L-1. S L. T. T. T. T. x 1. x 2. X L-1. x L. Reminder: Hidden Markov Model.
E N D
Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001. .
M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Reminder: Hidden Markov Model Markov Chain transition probabilities: p(Si+1= t|Si= s) = ast Emission probabilities: p(Xi = b| Si = s) = es(b)
M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Reminder: Most Probable state path Given an output sequence x = (x1,…,xL), Amost probablepaths*= (s*1,…,s*L)is one which maximizes p(s|x).
Reminder: Viterbi’s algorithm for most probable path s1 s2 sL-1 sL si 0 X1 X2 XL-1 XL Xi We add the special initial state 0. Initialization: v0(0) = 1 , vk(0) = 0 for k > 0 For i=1 to L do for each state l : vl(i) = el(xi) MAXk {vk(i-1)akl } ptri(l)=argmaxk{vk(i-1)akl} [storing previous state for reconstructing the path] Termination: the probability of the most probable path p(s1*,…,sL*;x1,…,xl) =
A- C- T- T+ G + A C T T G Predicting CpG islands via most probable path: • Output symbols: A, C, G, T (4 letters). • Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states). • The transitions probabilities ast and ek(b) will be discussed soon. • The most probable path found by Viterbi’s algorithm predicts CpG islands. Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Reminder: finding most probable state fl(i) = p(x1,…,xi,si=l ), the probability of a path which emits (x1,..,xi) and in which state si=l. bl(i)= p(xi+1,…,xL,si=l),the probability of a path which emits (xi+1,..,xL) and in which state si=l. • The forward algorithm finds {fk(si) = P(x1,…,xi,si=k): k = 1,...m}. • The backward algorithm finds {bk(si) = P(xi+1,…,xL|si=k): k = 1,...m}. • Return {p(Si=k|x) =fk(si) bk(si) |k=1,...,m}. • To Compute for every isimply run the forward and backward algorithms once, and compute {fk(si) bk(si)} for every i, k.
A- C- T- T+ G + A C T T G Finding the probability that a letteris in a CpG island via the algorithm for most probable state: • The probability that an occurrence of G is in a CpG island (+ state) is: • ∑s+p(Si =s+ |x) = ∑s+F(Si=s+)B(Si=s+) • Where the summation is formally over the 4 “+” states, but actually only state G+ need to be considered (why?)
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi akl k l ek(b) b Parameter Estimation for HMM An HMM model is defined by the parameters: akland ek(b), for all states k,l and all symbols b. Let θdenote the collection of these parameters.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Parameter Estimation for HMM To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xjis a sequence which is assumed to fit the model. Given the parameters θ, each sequence xj has an assigned probability p(xj|θ) (or p(xj| θ,HMM)).
ML Parameter Estimation for HMM The elements of the training set{x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ). ML parameter estimation looks for θ which maximizes the above. The exact method for finding or approximating this θdepends on the nature of the training set used.
M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Data for HMM • Possible properties of (the sequences in) the training set: • For each xj, what is our information on the states si (the symbolsxi are usually known). • The size (number of sequences) of the training set
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 1: Sequences are fully known We know the complete structure of each sequence in the training set{x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b. By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 1: ML method For each xjwe have: Let mkl= |{i: si-1=k,si=l}| (in xj). mk(b)=|{i:si=k,xi=b}| (in xj).
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 1 (cont) By the independence of the xj’s, p(x1,...,xn| θ)=∏ip(xj|θ). Thus, if Akl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:
Case 1 (cont) So we need to find akl’s and ek(b)’s which maximize: Subject to:
Case 1 (cont) Rewriting, we need to maximize:
Case 1 (cont) Then we will maximize also F. Each of the above is a simpler ML problem, which is treated next.
A simpler case: ML parameters estimation for a die Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a die. Here the parameters are θ ={1,2,3,4,5, 6} , ∑θi=1 Assume that the data is one sequence: Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6) So we have to maximize Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi0 ]
Side comment: Sufficient Statistics • To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6). • We do not need to recall the entire sequence of outcomes • {Ni | i=1…6} is called sufficient statistics for the multinomial sampling.
Datasets Statistics Sufficient Statistics • A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood • Formally, s(Data) is a sufficient statistics if for any two datasets D and D’ • s(Data) = s(Data’ ) P(Data|) = P(Data’|) Exercise: Define “sufficient statistics” for the HMM model.
A necessary condition for maximum is: ¶ log P ( Data | ) N N θ = - = j 6 0 å ¶ q q 5 - q 1 j j i = i 1 Maximum Likelihood Estimate By the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ). Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer:
Hence the MLE is given by: Finding the Maximum Rearranging terms: Divide the jth equation by the ith equation: Sum from j=1 to 6:
Generalization for distribution with any number n of outcomes Let X be a random variable with n values x1,…,xkdenoting the k outcomes of an iid experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n: Data = (xi1,xi2,...,xin) Then we have to maximize Subject to: θ1+θ2+ ....+ θk=1
Generalization for n outcomes (cont) By treatment identical to the die case, the maximum is obtained when for all i: Hence the MLE is given by:
Fractional Exponents Some models allow ni’s which are not integers(eg, when we are uncertain of a die outcome, and consider it “6” with 20% confidence and “5” with 80%): We still can have And the same analysis yields:
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Apply the ML method to HMM Let Akl = #(transitions from k to l) in the training set. Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Apply to HMM (cont.) We apply the previous technique to get for each k the parameters {akl|l=1,..,m} and {ek(b)|bΣ}: Which gives the optimal ML parameters
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Adding pseudo counts in HMM If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: rkl is our prior belief and transitions from k to l. rk(b) is our prior belief on emissions of b from state k.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Summary of Case 1: Sequences are fully known We know the complete structure of each sequence in the training set{x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b. We just showed a method which finds the (unique) parameters θ* which maximizes p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown: In this case only the values of the xi’s of the input sequences are known. This is a ML problem with “missing data”. Wewish to find θ* so that p(x|θ*)=MAXθ{p(x|θ)}. For each sequence x, p(x|θ)=∑s p(x,s|θ), taken over all state paths s.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown So we need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]
ML Parameter Estimation for HMM • The general process for finding θ in this case is • Start with an initial value of θ. • Find θ’ so thatp(x1,..., xn|θ’) > p(x1,..., xn|θ) • set θ = θ’. • Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Baum Welch training We start with some values of akland ek(b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ* s.t. p(x|θ*) > p(x|θ) Each iteration consists of few steps:
sL .. Si-1 Si s1 .. Xi-1 Xi X1 XL Baum Welch: step 1 Count expected number of state transitions: For each sequence xj, for each i and for each k,l, compute the posterior state transitions probabilities: P(si-1=k, si=l | xj,θ)
sL .. Si-1 Si s1 .. Xi-1 Xi X1 XL Baum Welch training Claim:
s1 s2 sL-1 sL Si-1 si X1 X2 XL-1 XL Xi-1 Xi xj = fk(i-1) aklek(xi )bl(i) Via the backward algorithm Via the forward algorithm Step 1: Computing P(si-1=k, si=l | xj,θ) P(x1,…,xL,si-1=k,si=l) = P(x1,…,xi-1,si-1=k) aklek(xi )P(xi+1,…,xL |si=l) fk(i-1)aklel(xi)bl(i) p(si-1=k,si=l | xj) = fk’(i-1)ak’l’ek’(xi )bl’(i) l’ K’
Step 1 (end) for each pair (k,l), compute the expected number of state transitions from k to l:
Baum-Welch: Step 2 for each state k and each symbol b, compute the expected number of emissions of b from k:
Baum-Welch: step 3 Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*. It can be shown that: p(x1,..., xn|θ*) > p(x1,..., xn|θ) i.e, θ* increases the probability of the data This procedure is iterated, until some convergence criterion is met.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown:Viterbi training Also start from given values of akland ek(b), which defines prior values of θ. Viterbi training attempts to maximize the probability of a most probable path; i.e., maximize p((s(x1),..,s(xn)) |θ, x1,..,xn) Where s(xj) is the most probable (under θ) path for xj.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown:Viterbi training • Each iteration: • Find a set {s(xj)}of most probable paths, which maximize • p(s(x1),..,s(xn) |θ, x1,..,xn) • 2. Find θ*,which maximizes • p(s(x1),..,s(xn) | θ*, x1,..,xn) • Note: In 1. the maximizing arguments are the paths, in 2. it is θ*. • 3. Set θ=θ* , and repeat. Stop when paths are not changed.
s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown:Viterbi training p(s(x1),..,s(xn) | θ*, x1,..,xn) can be expressed in a closed form (since we are using a single path for each xj), so this time convergence is achieved when the optimal paths are not changed any more.