1 / 69

Fast State Discovery for HMM Model Selection and Learning

Fast State Discovery for HMM Model Selection and Learning. Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU. O t. t. Consider a sequence of real-valued observations (speech, sensor readings, stock prices …). O t. t.

aren
Download Presentation

Fast State Discovery for HMM Model Selection and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU

  2. Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …)

  3. Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties

  4. Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties

  5. Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporalstructure

  6. Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure

  7. Ot t Current efficient approaches learn the wrong model

  8. Ot t Current efficient approaches learn the wrong model Our method successfully discovers the overlapping states

  9. Ot t Our goal: Efficiently discover states in sequential data while learning a Hidden Markov Model

  10. Motion Capture

  11. Definitions and Notation An HMM is ={A,B,}where A : NN transition matrix B : observation model {s ,s} for each of N states  : N1 prior probability vector T : size of observation sequence O1,…,OT qt: the state the HMM is in at time t. qt {s1,…,sN}

  12. Operations on HMMs

  13. Operations on HMMs

  14. Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima

  15. Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima

  16. Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima

  17. Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima We propose Simultaneous Temporal and Contextual Splitting (STACS) A top-down approach that is much better at state-discovery while being at least asefficient, and a variant V-STACS that is much faster.

  18. Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size)

  19. Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data)¼logP(data|model size,MLE) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate

  20. Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data)¼logP(data|model size,MLE) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate - BIC is an asymptotic approximation to the true posterior

  21. Algorithm Summary (STACS/VSTACS) • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, optimize by constrained EM (STACS)or constrained Viterbi Training (VSTACS) • Calculate approximate BIC score of split model • Choose best split based on approximate BIC • Compare to original model with exact BIC (STACS)or approximate BIC (VSTACS) • if larger model not chosen, stop

  22. STACS input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat

  23. STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q*

  24. STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2

  25. STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2 • Choose a subset D= {Ot : Q*(t) = s2}

  26. STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2 • Choose a subset D= {Ot : Q*(t) = s2} • Note that |D | = O(T/N)

  27. STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state

  28. STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out

  29. STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out • Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihoodP(O,Q*\D | s)

  30. STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out • Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihoodP(O,Q*\D | s) • Update Q* overD to get R*

  31. STACS input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat Scoring is of two types:

  32. STACS S2 S3 S3 S1 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat Scoring is of two types: • The candidates are comparedto each other according to their Viterbi path likelihoods vs.

  33. STACS S2 S3 S3 S1 S3 S2 S1 S1 S1 S2 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat Scoring is of two types: • The candidates are comparedto each other according to their Viterbi path likelihoods • The bestcandidate in this ranking is compared to theun-split model  using BIC, i.e. log P(model | data )  logP(data | model) – complexity penalty vs. vs.

  34. Viterbi STACS (V-STACS) input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat

  35. Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants

  36. Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants • V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates

  37. Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants • V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates • The Viterbi path likelihood is used to approximate the BIC vs. the un-split model in V-STACS

  38. Time Complexity • Optimizing N candidates takes • N O(T) time for STACS • N  O(T/N) time for V-STACS • Scoring N candidates takes N  O(T) time • Candidate search and scoring is O(TN) • Best-candidate evaluation is • O(TN2) for BIC in STACS • O(TN) for approximate BIC in V-STACS

  39. Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen)

  40. Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence

  41. Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways

  42. Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel”

  43. Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” • Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series”

  44. Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” • Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series” • Optimizes split of state s over all timesteps with nonzero posterior probability of being in state s [ i.e. O(T) data points]

  45. Results

  46. Data sets • Australian Sign-Language data collected from 2 Flock 5DT instrumented gloves and Ascension flock-of-birds tracker[Kadous 2002 (available in UCI KDD Archive)] • Other data sets obtained from the literature • Robot, MoCap, MLog, Vowel

  47. Learning HMMs of Predetermined Size:Scalability Robot data (others similar)

  48. Learning HMMs of Predetermined Size: Log-Likelihood Learning a 40-state HMM on Robot data (others similar)

  49. Learning HMMs of Predetermined Size Learning 40-state HMMs

  50. Model Selection: Synthetic Data • Generalize (4 states, T = 1000) to (10 states, T = 10,000)

More Related