700 likes | 851 Views
Fast State Discovery for HMM Model Selection and Learning. Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU. O t. t. Consider a sequence of real-valued observations (speech, sensor readings, stock prices …). O t. t.
E N D
Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU
Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …)
Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties
Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties
Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporalstructure
Ot t Consider a sequence of real-valued observations (speech, sensor readings, stock prices …) We can model it purely based on contextual properties However, we would miss important temporal structure
Ot t Current efficient approaches learn the wrong model
Ot t Current efficient approaches learn the wrong model Our method successfully discovers the overlapping states
Ot t Our goal: Efficiently discover states in sequential data while learning a Hidden Markov Model
Definitions and Notation An HMM is ={A,B,}where A : NN transition matrix B : observation model {s ,s} for each of N states : N1 prior probability vector T : size of observation sequence O1,…,OT qt: the state the HMM is in at time t. qt {s1,…,sN}
Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima
Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima
Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima
Previous Approaches • Multi-restart Baum-Welch N is inefficient, highly prone to local minima We propose Simultaneous Temporal and Contextual Splitting (STACS) A top-down approach that is much better at state-discovery while being at least asefficient, and a variant V-STACS that is much faster.
Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size)
Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data)¼logP(data|model size,MLE) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate
Bayesian Information Criterion (BIC) for Model Selection - Would like to compute the posterior probability for model selection P(model size|data)/P(data|model size) P(model size) log P(model size|data)/logP(data|model size) + log P(model size) - BIC assumes a prior that penalizes complexity (favors smaller models): log P(model size|data)¼logP(data|model size,MLE) – (#FP/2) log T where #FP = number of free parameters, T = length of data sequence, MLE is the ML parameter estimate - BIC is an asymptotic approximation to the true posterior
Algorithm Summary (STACS/VSTACS) • Initialize n0-state HMM randomly • for n = n0 … Nmax • Learn model parameters • for i = 1 … n • Split state i, optimize by constrained EM (STACS)or constrained Viterbi Training (VSTACS) • Calculate approximate BIC score of split model • Choose best split based on approximate BIC • Compare to original model with exact BIC (STACS)or approximate BIC (VSTACS) • if larger model not chosen, stop
STACS input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat
STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Learn parameters using EM, calculate the Viterbi path Q*
STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2
STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2 • Choose a subset D= {Ot : Q*(t) = s2}
STACS S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Learn parameters using EM, calculate the Viterbi path Q* • Consider splits on all states • e.g. for state s2 • Choose a subset D= {Ot : Q*(t) = s2} • Note that |D | = O(T/N)
STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Split the state
STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out
STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out • Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihoodP(O,Q*\D | s)
STACS S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Split the state • Constrains toexcept for offspring states’ observation densities andall their transition probabilities, both in and out • Learn the free parameters using two-state EM over D. This optimizes the partially observed likelihoodP(O,Q*\D | s) • Update Q* overD to get R*
STACS input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat Scoring is of two types:
STACS S2 S3 S3 S1 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat Scoring is of two types: • The candidates are comparedto each other according to their Viterbi path likelihoods vs.
STACS S2 S3 S3 S1 S3 S2 S1 S1 S1 S2 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat Scoring is of two types: • The candidates are comparedto each other according to their Viterbi path likelihoods • The bestcandidate in this ranking is compared to theun-split model using BIC, i.e. log P(model | data ) logP(data | model) – complexity penalty vs. vs.
Viterbi STACS (V-STACS) input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat
Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants
Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants • V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates
Viterbi STACS (V-STACS) S3 S1 S2 input:n0, data sequence O = {O1,…,OT} output: HMM of appropriate size • n0-state initial HMM repeat optimize over sequence O choose a subset of states for eachs design a candidate model s : choose a relevant subset of sequence O split state s, optimize sover subset score s end for if maxs(score(s)) > score() best-scoring candidate from {s} else terminate, return current end if end repeat • Recall that STACS learns the free parameters using two-state EM over D. However, EM also has “winner-take-all” variants • V-STACS uses two-state Viterbi training over D to learn the free parameters, which uses hard updates vs STACS’ soft updates • The Viterbi path likelihood is used to approximate the BIC vs. the un-split model in V-STACS
Time Complexity • Optimizing N candidates takes • N O(T) time for STACS • N O(T/N) time for V-STACS • Scoring N candidates takes N O(T) time • Candidate search and scoring is O(TN) • Best-candidate evaluation is • O(TN2) for BIC in STACS • O(TN) for approximate BIC in V-STACS
Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen)
Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence
Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways
Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel”
Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” • Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series”
Other Methods • Li-Biswas • Generates two candidates • splits state with highest-variance • merges pair of closest states (rarely chosen) • Optimizes all candidate parameters over entire sequence • ML-SSS • Generates 2N candidates, splitting each state in two ways • Contextual split: optimizes offspring states’ observation densities with 2-Gaussian mixture EM, assumes offspring connected ``in parallel” • Temporal split: optimizes offspring states’ observation densities, self-transitions and mutual transitions with EM, assumes offspring ``in series” • Optimizes split of state s over all timesteps with nonzero posterior probability of being in state s [ i.e. O(T) data points]
Data sets • Australian Sign-Language data collected from 2 Flock 5DT instrumented gloves and Ascension flock-of-birds tracker[Kadous 2002 (available in UCI KDD Archive)] • Other data sets obtained from the literature • Robot, MoCap, MLog, Vowel
Learning HMMs of Predetermined Size:Scalability Robot data (others similar)
Learning HMMs of Predetermined Size: Log-Likelihood Learning a 40-state HMM on Robot data (others similar)
Learning HMMs of Predetermined Size Learning 40-state HMMs
Model Selection: Synthetic Data • Generalize (4 states, T = 1000) to (10 states, T = 10,000)