230 likes | 454 Views
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data. Affiliation: Kyoto University Name: Kevin Chien , Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Dec. 9, 2011. Origin of Markov Models. Idea. Why Markov Models.
E N D
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Dec. 9, 2011
Why Markov Models • IID data not always possible. Illustrate future data (prediction) dependent on some recent data, using DAGs where inference is done by sum-product algorithm. • State Space (Markov) Model: Latent Variables • Discrete latent: Hidden Markov Model • Gaussian latent: Linear Dynamical Systems • Order of Markov Chain: data dependence • 1st order: Current observation depends only on previous 1 observation
State Space Model • Latent variable Zn forms a Markov chain. Each Zn contributes to its observation Xn. • As order grows #parameter grows, to organize this we use State Space Model • Zn-1 and Zn+1 is now independent given Zn (d-separated)
For understanding Markov Models Terminologies
Terminologies • Markovian Property: stochastic process that probability of a transition is dependent only on present state and not on the manner in which the current state is reached. • Transition diagram for same variable different state
Terminologies (cont.) [Big_O_notation, Wikipedia, Dec. 2011] • F is bounded above and below by g asymptotically • (review)Zn+1 and Zn-1 is d-separated given Zn: means given we block Zn’s outgoing edges there is no path from Zn+1 and Zn-1 =>independent
Formula and motivation Markov Models
Hidden Markov Models (HMM) • Zn discrete multinomial variable • Transition probability matrix • Sum of each row =1 • P(staying in present state) is non-zero • Counting non-diagonals K(K-1) parameters
Hidden Markov Models (cont.) • Emission (transition) probability with parameters governing the distribution • homogeneous model: latent variable share the same parameter A • Sampling data is simply noting the parameter values while following transitions with emission probability.
HMM, Expect. Max. for max. likelihood • Likelihood function: marginalizing over latent variables • Start with initial model parameters for • Evaluate • Defining • Likelihood function • results
HMM: forward-backward algorithm • 2 stage message passing in tree for HMM, to find marginals p(node) efficiently • Here the marginals are • Assume p(xk|zk), p(zk|zk-1),p(z1) known • X=(x1,..,xn), xi:j=(xi,xi+1,..,xj) • Goal compute p(zk|x) • Forward part: compute p(zk, x1:k) for every k=1,..,n • Backward part: compute p(xk+1:n|zk) for every k=1,…,n
HMM: forward-backward algorithm (cont.) • P(zk|x) ∝ p(zk,x) = p(xk+1:n|zk,x1:k) p(zk,x1:k) • Where xk+1:n and x1:kare d-separated given zk • so P(zk|x) ∝ p(zk,x) = p(xk+1:n|zk)p(zk,x1:k) • Now we can do • EM algorithm and Baum-Welch algorithm to estimate parameter values • Sample from posterior z given x. Most likely z with Viterbi algorithm
HMM forward-backward algorithm: Forward part • Compute p(zk,x1:k) • p(zk,x1:k)=∑(all values of zk-1) p(zk,zk-1,x1:k) = ∑(all values of zk-1) p(xk|zk,zk-1,x1:k-1)p(zk|zk-1,x1:k-1)p(zk-1,x1:k-1) mm…look like a recursive function, if p(zk,x1:k) is labeled αk(zk) then • zk-1,x1:k-1 and xkd-separated given zk • zk and xk-1 d-separated given zk-1 So αk(zk)=∑(all values of zk-1) p(xk|zk)p(zk|zk-1) αk-1(zk-1) For k=2,..,n Emission prob. transition prob. recursive part
HMM forward-backward algorithm: Forward part (cont.) • α1(z1)=p(z1,x1)=p(z1)p(x1|z1) • If each z has m states then computational complexity is • Θ(m) for each zk for one k • Θ(m2) for each k • Θ(nm2) in total
HMM forward-backward algorithm: Backward part • Compute p(xk+1:n|zk) for all zk and all k=1,..,n-1 • p(xk+1:n|zk)=∑(all values of zk+1) p(xk+1:n,zk+1|zk) =∑(all values of zk+1) p(xk+2:n|zk+1,zk,xk+1)p(xk+1|zk+1,zk)p(zk+1|zk) mm…look like a recursive function, if p(xk+1:n|zk) is labeled βk(zk) then • zk,xk+1 and xk+2:n d-separated given zk+1 • zk and xk+1 d-separated given zk+1 So βk(zk) =∑(all values of zk+1) βk+1(zk+1) p(xk+1|zk+1)p(zk+1|zk) For k=1,..,n-1 recursive part Emission prob. transition prob.
HMM forward-backward algorithm: Backward part (cont.) • βn(zn) =1 for all zn • If each z has m states then computational complexity is same as forward part • Θ(nm2) in total
HMM: Viterbi algorithm • Max-sum algorithm for HMM, to find most probable sequence of hidden states for a given observation sequence X1:n • Example: transform handwriting images into text • Assume p(xk|zk), p(zk|zk-1),p(z1) known • Goal: compute z*= argmaxz p(z|x) • Given x=x1:n, z=z1:n • Given lemma f(a)≥0 ∀a and g(a,b) ≥0 ∀a,b then • Maxa,b f(a)g(a,b) = maxa[f(a) maxb g(a,b)] • maxz p(z|x) ∝ maxz p(z,x)
HMM: Viterbi algorithm (cont.) • μk(zk)=maxz1:k p(z1:k,x1:k) =maxz1:k p(xk|zk)p(zk|zk-1) …..f(a) part p(z1:k-1,x1:k-1) ....g(a,b) part mm…look like a recursive function, if we can make max to appear in front of p(z1:k-1,x1:k-1). Use lemma - by setting a=zk-1, b=z1:k-2 =maxzk-1[p(xk|zk)p(zk|zk-1) maxz1:k-2 p(z1:k-1,x1:k-1)] =maxzk-1[p(xk|zk) p(zk|zk-1) μk-1(zk-1) ] For k=2,…,n
HMM: Viterbi algorithm (finish up) • μk(zk)=maxzk-1 p(xk|zk) p(zk|zk-1) μk-1(zk-1) μ1(z1)= p(x1,z1)=p(z1)p(x1|z1) • Same method to get maxz μn(zn)=maxz p(x,z) • We can get max value, to get max sequence, compute recursive equation bottom-up while remembering values (μk(zk) looks at all paths of μk-1(zk-1)) For k=2,…,n
Additional Information • Excerpt of equations and diagrams from [Pattern Recognition and Machine Learning, Bishop C.M.] page 605-646 • Excerpt of equations from Mathematicalmonk, Youtube LLC, Google Inc., (ML 14.6 and 14.7) various titles, July 2011