Parameter Estimation For HMM

Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001. .

M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Reminder: Hidden Markov Model Markov Chain transition probabilities: p(Si+1= t|Si= s) = ast Emission probabilities: p(Xi = b| Si = s) = es(b)

M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Reminder: Most Probable state path Given an output sequence x = (x1,…,xL), Amost probablepaths*= (s*1,…,s*L)is one which maximizes p(s|x).

Reminder: Viterbi’s algorithm for most probable path s1 s2 sL-1 sL si 0 X1 X2 XL-1 XL Xi We add the special initial state 0. Initialization: v0(0) = 1 , vk(0) = 0 for k > 0 For i=1 to L do for each state l : vl(i) = el(xi) MAXk {vk(i-1)akl } ptri(l)=argmaxk{vk(i-1)akl} [storing previous state for reconstructing the path] Termination: the probability of the most probable path p(s1*,…,sL*;x1,…,xl) =

A- C- T- T+ G + A C T T G Predicting CpG islands via most probable path: • Output symbols: A, C, G, T (4 letters). • Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states). • The transitions probabilities ast and ek(b) will be discussed soon. • The most probable path found by Viterbi’s algorithm predicts CpG islands. Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Reminder: finding most probable state fl(i) = p(x1,…,xi,si=l ), the probability of a path which emits (x1,..,xi) and in which state si=l. bl(i)= p(xi+1,…,xL,si=l),the probability of a path which emits (xi+1,..,xL) and in which state si=l. • The forward algorithm finds {fk(si) = P(x1,…,xi,si=k): k = 1,...m}. • The backward algorithm finds {bk(si) = P(xi+1,…,xL|si=k): k = 1,...m}. • Return {p(Si=k|x) =fk(si) bk(si) |k=1,...,m}. • To Compute for every isimply run the forward and backward algorithms once, and compute {fk(si) bk(si)} for every i, k.

A- C- T- T+ G + A C T T G Finding the probability that a letteris in a CpG island via the algorithm for most probable state: • The probability that an occurrence of G is in a CpG island (+ state) is: • ∑s+p(Si =s+ |x) = ∑s+F(Si=s+)B(Si=s+) • Where the summation is formally over the 4 “+” states, but actually only state G+ need to be considered (why?)

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi akl k l ek(b) b Parameter Estimation for HMM An HMM model is defined by the parameters: akland ek(b), for all states k,l and all symbols b. Let θdenote the collection of these parameters.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Parameter Estimation for HMM To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xjis a sequence which is assumed to fit the model. Given the parameters θ, each sequence xj has an assigned probability p(xj|θ) (or p(xj| θ,HMM)).

ML Parameter Estimation for HMM The elements of the training set{x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ). ML parameter estimation looks for θ which maximizes the above. The exact method for finding or approximating this θdepends on the nature of the training set used.

M M M M S1 S2 SL-1 SL T T T T x1 x2 XL-1 xL Data for HMM • Possible properties of (the sequences in) the training set: • For each xj, what is our information on the states si (the symbolsxi are usually known). • The size (number of sequences) of the training set

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 1: Sequences are fully known We know the complete structure of each sequence in the training set{x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b. By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 1: ML method For each xjwe have: Let mkl= |{i: si-1=k,si=l}| (in xj). mk(b)=|{i:si=k,xi=b}| (in xj).

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 1 (cont) By the independence of the xj’s, p(x1,...,xn| θ)=∏ip(xj|θ). Thus, if Akl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

Case 1 (cont) So we need to find akl’s and ek(b)’s which maximize: Subject to:

Case 1 (cont) Rewriting, we need to maximize:

Case 1 (cont) Then we will maximize also F. Each of the above is a simpler ML problem, which is treated next.

A simpler case: ML parameters estimation for a die Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a die. Here the parameters are θ ={1,2,3,4,5, 6} , ∑θi=1 Assume that the data is one sequence: Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6) So we have to maximize Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi0 ]

Side comment: Sufficient Statistics • To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6). • We do not need to recall the entire sequence of outcomes • {Ni | i=1…6} is called sufficient statistics for the multinomial sampling.

Datasets Statistics Sufficient Statistics • A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood • Formally, s(Data) is a sufficient statistics if for any two datasets D and D’ • s(Data) = s(Data’ ) P(Data|) = P(Data’|) Exercise: Define “sufficient statistics” for the HMM model.

A necessary condition for maximum is: ¶ log P ( Data | ) N N θ = - = j 6 0 å ¶ q q 5 - q 1 j j i = i 1 Maximum Likelihood Estimate By the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ). Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer:

Hence the MLE is given by: Finding the Maximum Rearranging terms: Divide the jth equation by the ith equation: Sum from j=1 to 6:

Generalization for distribution with any number n of outcomes Let X be a random variable with n values x1,…,xkdenoting the k outcomes of an iid experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n: Data = (xi1,xi2,...,xin) Then we have to maximize Subject to: θ1+θ2+ ....+ θk=1

Generalization for n outcomes (cont) By treatment identical to the die case, the maximum is obtained when for all i: Hence the MLE is given by:

Fractional Exponents Some models allow ni’s which are not integers(eg, when we are uncertain of a die outcome, and consider it “6” with 20% confidence and “5” with 80%): We still can have And the same analysis yields:

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Apply the ML method to HMM Let Akl = #(transitions from k to l) in the training set. Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Apply to HMM (cont.) We apply the previous technique to get for each k the parameters {akl|l=1,..,m} and {ek(b)|bΣ}: Which gives the optimal ML parameters

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Adding pseudo counts in HMM If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: rkl is our prior belief and transitions from k to l. rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Summary of Case 1: Sequences are fully known We know the complete structure of each sequence in the training set{x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b. We just showed a method which finds the (unique) parameters θ* which maximizes p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown: In this case only the values of the xi’s of the input sequences are known. This is a ML problem with “missing data”. Wewish to find θ* so that p(x|θ*)=MAXθ{p(x|θ)}. For each sequence x, p(x|θ)=∑s p(x,s|θ), taken over all state paths s.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown So we need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences Swhich produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

ML Parameter Estimation for HMM • The general process for finding θ in this case is • Start with an initial value of θ. • Find θ’ so thatp(x1,..., xn|θ’) > p(x1,..., xn|θ) • set θ = θ’. • Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Baum Welch training We start with some values of akland ek(b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ* s.t. p(x|θ*) > p(x|θ) Each iteration consists of few steps:

sL .. Si-1 Si s1 .. Xi-1 Xi X1 XL Baum Welch: step 1 Count expected number of state transitions: For each sequence xj, for each i and for each k,l, compute the posterior state transitions probabilities: P(si-1=k, si=l | xj,θ)

sL .. Si-1 Si s1 .. Xi-1 Xi X1 XL Baum Welch training Claim:

s1 s2 sL-1 sL Si-1 si X1 X2 XL-1 XL Xi-1 Xi xj = fk(i-1) aklek(xi )bl(i) Via the backward algorithm Via the forward algorithm Step 1: Computing P(si-1=k, si=l | xj,θ) P(x1,…,xL,si-1=k,si=l) = P(x1,…,xi-1,si-1=k) aklek(xi )P(xi+1,…,xL |si=l) fk(i-1)aklel(xi)bl(i) p(si-1=k,si=l | xj) =  fk’(i-1)ak’l’ek’(xi )bl’(i) l’ K’

Step 1 (end) for each pair (k,l), compute the expected number of state transitions from k to l:

Baum-Welch: Step 2 for each state k and each symbol b, compute the expected number of emissions of b from k:

Baum-Welch: step 3 Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*. It can be shown that: p(x1,..., xn|θ*) > p(x1,..., xn|θ) i.e, θ* increases the probability of the data This procedure is iterated, until some convergence criterion is met.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown:Viterbi training Also start from given values of akland ek(b), which defines prior values of θ. Viterbi training attempts to maximize the probability of a most probable path; i.e., maximize p((s(x1),..,s(xn)) |θ, x1,..,xn) Where s(xj) is the most probable (under θ) path for xj.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown:Viterbi training • Each iteration: • Find a set {s(xj)}of most probable paths, which maximize • p(s(x1),..,s(xn) |θ, x1,..,xn) • 2. Find θ*,which maximizes • p(s(x1),..,s(xn) | θ*, x1,..,xn) • Note: In 1. the maximizing arguments are the paths, in 2. it is θ*. • 3. Set θ=θ* , and repeat. Stop when paths are not changed.

s1 s2 sL-1 sL si X1 X2 XL-1 XL Xi Case 2: State paths are unknown:Viterbi training p(s(x1),..,s(xn) | θ*, x1,..,xn) can be expressed in a closed form (since we are using a single path for each xj), so this time convergence is achieved when the optimal paths are not changed any more.

Parameter Estimation For HMM

Parameter Estimation For HMM

Presentation Transcript

Parameter estimation class 5

V9: Parameter Estimation for Bayesian Networks

Parameter estimation

Least Square Method for Parameter Estimation

Model Reduction for Parameter Estimation

Regression Techniques for Performance Parameter Estimation

Parameter estimation

Code-Level Parameter Estimation

Bayesian Parameter Estimation Techniques for LISA

Chapter 6 parameter estimation

Parameter estimation

PARAMETER ESTIMATION

Parameter Estimation For HMM Lecture #7

Parameter estimation class 5

Parameter estimation

Parameter Estimation

Parameter estimation

Parameter estimation class 5

Section 4: Parameter Estimation

Learning: Parameter Estimation

Parameter Estimation

Parameter Estimation