CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning Temporal sequences: Hidden Markov Models and Dynamic Bayesian Networks

Motivation • Observing a stream of data • Monitoring (of people, computer systems, etc) • Surveillance, tracking • Finance & economics • Science • Questions: • Modeling & forecasting • Unobserved variables

Time Series Modeling • Time occurs in steps t=0,1,2,… • Time step can be seconds, days, years, etc • State variable Xt, t=0,1,2,… • For partially observed problems, we see observations Ot, t=1,2,… and do not see the X’s • X’s are hidden variables (aka latent variables)

Modeling Time • Arrow of time • Causality => Bayesian networks are natural models of time series Causes Effects

X0 X1 X2 X3 Probabilistic Modeling • For now, assume fully observable case • What parents? X0 X1 X2 X3

X0 X0 X0 X0 X1 X1 X1 X1 X2 X2 X2 X2 X3 X3 X3 X3 Markov Assumption • Assume Xt+k is independent of all Xi for i<tP(Xt+k | X0,…,Xt+k-1) = P(Xt+k | Xt,…,Xt+k-1) • K-th order Markov Chain Order 0 Order 1 Order 2 Order 3

Y0 X0 Y1 X1 X2 Y2 X3 Y3 1st order Markov Chain • MC’s of order k>1 can be converted into a 1st order MC on the variable Yt = {Xt,…,Xt+k-1} • So w.o.l.o.g., “MC” refers to a 1st order MC X0 X1’ X2’ X3’ X1 X2 X3 X4

Inference in MC • What independence relationships can we read from the BN? X0 X1 X2 X3 Observe X1 X0 independent of X2, X3, … P(Xt|Xt-1) known as transition model

Inference in MC • Prediction: the probability of future state? • P(Xt) = Sx0,…,xt-1P (X0,…,Xt) = Sx0,…,xt-1P (X0) Px1,…,xt P(Xi|Xi-1)= Sxt-1P(Xt|Xt-1) P(Xt-1) • Approach: maintain a belief statebt(X)=P(Xt), use above equation to advance to bt+1(X) • Equivalent to VE algorithm in sequential order [Recursive approach]

Belief state evolution • P(Xt) = Sxt-1P(Xt|Xt-1) P(Xt-1) • “Blurs” over time, and (typically) approaches a stationary distribution as t grows • Limited prediction power • Rate of blurring known as mixing time

Stationary distributions • For discrete variables Val(X)={1,…,n}: • Transition matrix Tij = P(Xt=i|Xt-1=j) • Belief bt(X) is just a vector bt,i=P(Xt=i) • Belief update equation: bt+1 = T*bt • A stationary distribution b is one in which b = Tb • => b is an eigenvector of T with eigenvalue 1 • => b is in the null space of (T-I)

History Dependence • In Markov models, the state must be chosen so that the future is independent of history given the current state • Often this requires adding variables that cannot be directly observed minimum essentials “the bare” market wipes himselfwith the rabbit Are these people walking toward you or away from you? What comes next?

X0 X1 X2 X3 Partial Observability • Hidden Markov Model (HMM) Hidden state variables Observed variables O1 O2 O3 P(Ot|Xt) called the observation model (or sensor model)

X0 X1 X2 X3 Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation O1 O2 O3

Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query variable X0 X1 X2 O1 O2

Predict-Update interpretation • Given old belief state bt-1(X) • Predict: First compute MC updatebt’(Xt)=P(Xt|o1:t-1) = aSxbt-1(x) P(Xt|Xt-1=x) • Update: Re-weight to account for observation probabilities: • bt(x) = bt’(x)P(ot|Xt=x) Query variable X0 X1 X2 O1 O2

Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query X0 X1 X2 X3 O1 O2 O3

Prediction • P(Xt+k|o1:t) • 2 steps: P(Xt|o1:t), then P(Xt+k|Xt) • Filterto time t, then predict as with standard MC Query X0 X1 X2 X3 O1 O2 O3

Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation Query X0 X1 X2 X3 O1 O2 O3

Interpretation • Filtering/prediction: • Equivalent to forward variable elimination / belief propagation • Smoothing: • Equivalent to forward VE/BP up to query variable, then backward VE/BP from last observation back to query variable • Running BP to completion gives the smoothed estimates for all variables (forward-backward algorithm)

Inference in HMMs • Filtering • Prediction • Smoothing, aka hindsight • Most likely explanation • Subject of next lecture Query returns a path through state space x0,…,x3 X0 X1 X2 X3 O1 O2 O3

Applications of HMMs in NLP • Speech recognition • Hidden phones(e.g., ah eh ee th r) • Observed, noisy acoustic features (produced by signal processing)

Phone Observation Models Phonet Model defined to be robust over variations in accent, speed, pitch, noise Featurest Signal processing Features(24,13,3,59)

Phone Transition Models Phonet Phonet+1 Good models will capture (among other things): Pronunciation of wordsSubphone structure Coarticulation effects Triphone models = order 3 Markov chain Featurest

Word Segmentation • Words run together when pronounced • Unigrams P(wi) • Bigrams P(wi|wi-1) • Trigrams P(wi|wi-1,wi-2) Random 20 word samples from R&N using N-gram models Logical are as confusion a may right tries agent goal the was diesel more object then information-gathering search is Planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic tac toe a predicate Planning and scheduling are integrated the success of naïve bayes model is just a possible prior source by that time

What about models with many variables? • Say X has n binary variables, O has m binary variables • Naively, a distribution over Xt may be intractable to represent (2n entries) • Transition models P(Xt|Xt-1) require 22n entries • Observation models P(Ot|Xt) require 2n+m entries • Is there a better way?

Example: Failure detection • Consider a battery meter sensor • Battery = true level of battery • BMeter = sensor reading • Transient failures: send garbage at time t • Persistent failures: send garbage forever

Example: Failure detection • Consider a battery meter sensor • Battery = true level of battery • BMeter = sensor reading • Transient failures: send garbage at time t • 5555500555… • Persistent failures: sensor is broken • 5555500000…

Dynamic Bayesian Network • Template model relates variables on prior time step to the next time step (2-TBN) • “Unrolling” the template for all t gives the ground Bayesian network Batteryt-1 Batteryt BMetert BMetert ~ N(Batteryt,s)

Dynamic Bayesian Network Batteryt-1 Batteryt BMetert BMetert ~ N(Batteryt,s) Transient failure model P(BMetert=0 | Batteryt=5) = 0.03

With model Without model Results on Transient Failure Meter reads 55555005555… Transient failure occurs E(Batteryt)

Results on Persistent Failure Meter reads 5555500000… Persistent failure occurs E(Batteryt) With transient model

Persistent Failure Model Brokent-1 Brokent Batteryt-1 Batteryt BMetert BMetert ~ N(Batteryt,s) P(BMetert=0 | Batteryt=5) = 0.03 P(BMetert=0 | Brokent) = 1

With persistent failure model Results on Persistent Failure Meter reads 5555500000… Persistent failure occurs E(Batteryt) With transient model

How to perform inference on DBN? • Exact inference on “unrolled” BN • E.g. Variable Elimination • Typical order:eliminate sequential time steps so that the network isn’t actually constructed • Unrolling is done only implicitly Br0 Br1 Br2 Br3 Br4 Ba0 Ba1 Ba2 Ba3 Ba4 BM1 BM2 BM3 BM4

Entanglement Problem • After n time steps, all n variables in the belief state become dependent! • Unless 2-TBN can be partitioned into disjoint subsets (rare) • Lost sparsitystructure

Approximate inference in DBNs • Limited history updates • Assumed factorization of belief state • Particle filtering

Independent Factorization • Idea: assume belief state P(Xt) factors across individual attributes P(Xt) = P(X1,t)*…*P(Xn,t) • Filtering: only maintain factored distributions P(X1,t|O1:t),…,P(Xn,t|O1:t) • Filtering update: P(Xk,t|O1:t) = Sxt-1P(Xk,t|Ot,Xt-1) P(Xt-1|O1:t-1) = marginal probability query over 2-TBN X1,t-1 X1,t O1,t Om,t Xn,t-1 Xn,t

Next time • Viterbi algorithm • Read K&F 13.2 for some context • Kalman and particle filtering • Read K&F 15.3-4

CS b553: Algorithms for Optimization and Learning