1 / 35

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs). Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic in nature Can be used for handwriting, keystroke biometrics. Classification with Static Features. Simpler than dynamic problem

burke
Download Presentation

Hidden Markov Models (HMMs)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models (HMMs) • Probabilistic Automata • Ubiquitous in Speech/Speaker Recognition/Verification • Suitable for modelling phenomena which are dynamic in nature • Can be used for handwriting, keystroke biometrics

  2. Classification with Static Features • Simpler than dynamic problem • Can use, for example, MLPs • E.g. In two dimensional space: o x x o o x o x o x o x x o

  3. Hidden Markov Models (HMMs) • First: Visible VMMs • Formal Definition • Recognition • Training • HMMs • Formal Definition • Recognition • Training • Trellis Algorithms • Forward-Backward • Viterbi

  4. Visible Markov Models • Probabilistic Automaton • Ndistinct statesS = {s1, …, sN} • M-element output alphabetK = {k1, …, kM} • Initial state probabilitiesΠ = {πi}, i  S • State transition att = 1, 2,… • State trans. probabilitiesA = {aij}, i,j  S • State sequenceX = {X1, …, XT}, Xt S • Output seq.O = {o1, …, oT}, ot K

  5. VMM: Weather Example

  6. Generative VMM • We choose the state sequenceprobabilistically… • We could try this using: • the numbers 1-10 • drawing from a hat • an ad-hoc assignment scheme

  7. 2 Questions • Training Problem • Given an observation sequenceOand a “space” of possible models which spans possible values for model parameters w = {A, Π}, how do we find the model that best explains the observed data? • Recognition (decoding) problem • Given a modelwi = {A, Π}, how do we compute how likely a certain observation is, i.e. P(O | wi) ?

  8. Training VMMs • Given observation sequencesOs, we want to find model parameters w = {A, Π} which best explain the observations • I.e. we want to find values forw = {A, Π} that maximisesP(O | w) • {A, Π} chosen = argmax {A, Π} P(O | {A, Π})

  9. Training VMMs • Straightforward for VMMs • frequency in state i at time t =1 (number of transitions fromstate i to state j) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- (number of transitions fromstatei) = (number of transitions fromstate i to statej) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- (number of times in statei)

  10. Recognition • We need to calculate P(O | wi) • P(O | wi)is handy for calculatingP(wi|O) • If we have a set of modelsL = {w1,w2,…,wV} then if we can calculateP(wi|O) we can choose the model which returns the highest probability, i.e. • wchosen = argmax wi L P(wi|O)

  11. Recognition • Why is P(O | wi) of use? • Let’s revisit speech for a moment. • In speech we are given a sequence of observations, e.g. a series of MFCC vectors • E.g. MFCCs taken from frames of length 20-40ms, every 10-20 ms • If we have a set of modelsL = {w1,w2,…,wV} and if we can calculateP(wi|O) we can choose the model which returns the highest probability, i.e. wchosen = argmax wi L P(wi|O)

  12. wchosen = argmax wi L P(wi|O) • P(wi|O) difficult to calculate as we would have to have a model for every possible observation sequence O • Use Bayes’ rule: P(x | y) = P (y | x) P(x) / P(y) • So now we have wchosen = argmax wi L P(O |wi) P(wi) / P(O) • P(wi) can be easily calculated • P(O) is the same for each calculation and so can be ignored • SoP(O |wi) is the key!!!

  13. Hidden Markov Models • Probabilistic Automaton • Ndistinct statesS = {s1, …, sN} • M-element output alphabetK = {k1, …, kM} • Initial state probabilitiesΠ = {πi}, i  S • State transition att = 1, 2,… • State trans. probabilitiesA = {aij}, i,j  S • Symbol emission probabilities B = {bik}, i  S, k  K • State sequenceX = {X1, …, XT}, Xt S • Output sequenceO = {o1, …, oT}, ot K

  14. HMM: Weather Example

  15. State Emission Distributions Discrete probability distribution

  16. State Emission Distributions Continuous probability distribution

  17. Generative HMM • Now we not only choose the state sequence probabilistically… • …but also the state emissions • Try this yourself using the numbers 1-10 and drawing from a hat...

  18. 3 Questions • Recognition (decoding) problem • Given a modelwi = {A, B, Π}, how do we compute how likely a certain observation is, i.e.P(O | wi) ? • State sequence? • Given the observation sequence and a model how do we choose a state sequenceX = {X1, …, XT} that best explains the observations • Training Problem • Given an observation sequenceOand a “space” of possible models which spans possible values for model parameters w = {A, B, Π}, how do we find the model that best explains the observed data?

  19. ComputingP(O | w) • For any particular state sequenceX = {X1, …, XT} we have and

  20. ComputingP(O | w) • This requires(2T) NTmultiplications • Very inefficient!

  21. Trellis Algorithms • Array of states vs. time

  22. Trellis Algorithms • Overlap in paths implies repetition of the same calculations • Harness the overlap to make calculations efficient • A node at (si , t) stores info about state sequences that contain Xt = si

  23. Trellis Algorithms • Consider 2 states and 3 time points:

  24. Forward Algorithm • A node at(si , t)stores info about state sequences up to timetthat arrive at si s1 sj s2

  25. Forward Algorithm

  26. Backward Algorithm • A node at(si , t)stores infoabout state sequences from timetthat evolve fromsi s1 si s2

  27. Backward Algorithm

  28. Forward & Backward Algorithms • P(O | w) as calculated from the forward and backward algorithms should be the same • FB algorithm usually used in training • FB algorithm not suited to recognition as it considers all possible state sequences • In reality, we would like to only consider the “best” state sequence (HMM problem 2)

  29. “Best” State Sequence • How is “best” defined? • We could choose most likely individual state at each timet:

  30. Define

  31. Viterbi Algorithm …may produce an unlikely or even invalid state sequence • One solution is to choose the most likely state sequence:

  32. Viterbi Algorithm • Define is the best score along a single path, at time t, which accounts for the firsttobservations and ends in statesi By induction we have:

  33. Viterbi Algorithm

  34. Viterbi Algorithm

  35. Viterbi vs. Forward Algorithm • Similar in implementation • Forward sums over all incoming paths • Viterbi maximises • Viterbi probability  Forward probability • Both efficiently implemented using a trellis structure

More Related