Fast State Discovery for HMM Model Selection and Learning

Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU

Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …)

Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties

Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporalstructure

Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure

Ot t Current efficient approaches learn the wrong model

Ot t Current efficient approaches learn the wrong model Our method successfully discovers the overlapping states

Ot t Our goal: Efficiently discover states in sequential data while learning a Hidden Markov Model

Motion Capture

Definitions and Notation An HMM is ={A,B,}where A : NN transition matrix B : observation model {s ,s} for each of N states  : N1 prior probability vector T : size of observation sequence O1,…,OT qt: the state the HMM is in at time t. qt {s1,…,sN}

Operations on HMMs

Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima

Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima We propose Simultaneous Temporal and Contextual Splitting (STACS) A top-down approach that is much better at state-discovery while being at least asefficient, and a variant V-STACS that is much faster.

Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size)

Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data)¼logP(data|model size,MLE) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate

Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data)¼logP(data|model size,MLE) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate - BIC is an asymptotic approximation to the true posterior

Algorithm Summary (STACS/VSTACS) • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, optimize by constrained EM (STACS)or constrained Viterbi Training (VSTACS) • Calculate approximate BIC score of split model • Choose best split based on approximate BIC • Compare to original model with exact BIC (STACS)or approximate BIC (VSTACS) • if larger model not chosen, stop

STACS input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat

STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q*

STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2

STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2 • Choose a subset D= {Ot : Q*(t) = s2}

STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2 • Choose a subset D= {Ot : Q*(t) = s2} • Note that |D | = O(T/N)

STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state

STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out

STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out • Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihoodP(O,Q*\D | s)

STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out • Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihoodP(O,Q*\D | s) • Update Q* overD to get R*

STACS input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat Scoring is of two types:

STACS S2 S3 S3 S1 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat Scoring is of two types: • The candidates are comparedto each other according to their Viterbi path likelihoods vs.

STACS S2 S3 S3 S1 S3 S2 S1 S1 S1 S2 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat Scoring is of two types: • The candidates are comparedto each other according to their Viterbi path likelihoods • The bestcandidate in this ranking is compared to theun-split model  using BIC, i.e. log P(model | data )  logP(data | model) – complexity penalty vs. vs.

Viterbi STACS (V-STACS) input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat

Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants

Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants • V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates

Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize  over sequence O choose a subset of states  for eachs  design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score()  best-scoring candidate from {s} else terminate, return current  end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants • V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates • The Viterbi path likelihood is used to approximate the BIC vs. the un-split model in V-STACS

Time Complexity • Optimizing N candidates takes • N O(T) time for STACS • N  O(T/N) time for V-STACS • Scoring N candidates takes N  O(T) time • Candidate search and scoring is O(TN) • Best-candidate evaluation is • O(TN2) for BIC in STACS • O(TN) for approximate BIC in V-STACS

Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen)

Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence

Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways

Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel”

Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” • Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series”

Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” • Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series” • Optimizes split of state s over all timesteps with nonzero posterior probability of being in state s [ i.e. O(T) data points]

Results

Data sets • Australian Sign-Language data collected from 2 Flock 5DT instrumented gloves and Ascension flock-of-birds tracker[Kadous 2002 (available in UCI KDD Archive)] • Other data sets obtained from the literature • Robot, MoCap, MLog, Vowel

Learning HMMs of Predetermined Size:Scalability Robot data (others similar)

Learning HMMs of Predetermined Size: Log-Likelihood Learning a 40-state HMM on Robot data (others similar)

Learning HMMs of Predetermined Size Learning 40-state HMMs

Model Selection: Synthetic Data • Generalize (4 states, T = 1000) to (10 states, T = 10,000)

Fast State Discovery for HMM Model Selection and Learning

Fast State Discovery for HMM Model Selection and Learning

Presentation Transcript

Math Models for Learning and Discovery

Discovery, teaching and learning for all

Model Uncertainty and Model Selection

Network Discovery and Selection/Re-selection

Hidden Markov Model (HMM) - Tutorial

Speech Recognition and HMM Learning

Learning HMM parameters

Feature Selection and Causal discovery

Hidden Markov Model (HMM) Tagging

Model Selection

Initial HMM Model

Learning optimal audiovisual phasing for an HMM-based control model for facial animation

Learning and discovery

A Machine Learning Approach for Automatic Student Model Discovery

Model selection and model building

Learning HMM parameters

Network Discovery and Selection/Re-selection

Math Models for Learning and Discovery

Model Selection

Feature Selection and Causal discovery