700 likes | 862 Views
HMMs. Tamara Berg CS 590-133 Artificial Intelligence. Many slides throughout the course adapted from Svetlana Lazebnik , Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer , Rob Pless , Killian Weinberger, Deva Ramanan. 2048 Game. Game.
E N D
HMMs Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan
2048 Game Game AI (Minimax with alpha/beta pruning)
Announcements • HW2 graded • mean = 14.40, median = 16 • Grades to date after 2 HWs, 1 midterm (60% / 40%) • mean = 77.42, median = 81.36
Announcements • HW4 will be online tonight (due April 3) • Midterm2 next Thursday • Review will be Monday, 5pm in FB009
Thursday March 20SN0115:30-6:30 Information Session: BS/MS Program
Bayes Nets Directed Graph, G = (X,E) Nodes Edges • Each node is associated with a random variable
Example Joint Probability:
Independence in a BN • Important question about a BN: • Are two nodes independent given certain evidence? • If yes, can prove using algebra (tedious in general) • If no, can prove with a counter example • Example: • Question: are X and Z necessarily independent? • Answer: no. Example: low pressure causes rain, which causes traffic. • X can influence Z, Z can influence X (via Y) • Addendum: they could be independent: how? X Y Z
Bayes Ball • Shade all observed nodes. Place balls at the starting node, let them bounce around according to some rules, and ask if any of the balls reach any of the goal node. • We need to know what happens when a ball arrives at a node on its way to the goal node.
Inference • Inference: calculating some useful quantity from a joint probability distribution • Examples: • Posterior probability: • Most likely explanation: B E A J M
Inference Algorithms • Exact algorithms • Inference by Enumeration • Variable Elimination algorithm • Approximate (sampling) algorithms • Forward sampling • Rejection sampling • Likelihood weighting sampling
Example BN – Naïve Bayes C – Class F - Features We only specify (parameters): prior over class labels how each feature depends on the class
Percentage of documents in training set labeled as spam/ham Slide from Dan Klein
In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein
In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein
Parameter estimation • Parameter estimate: • Parameter smoothing: dealing with words that were never seen or seen too few times • Laplacian smoothing: pretend you have seen every vocabulary word one more time (or m more times) than you actually did # of word occurrences in spam messages P(word | spam) = total # of words in spam messages # of word occurrences in spam messages + 1 P(word | spam) = total # of words in spam messages + V (V: total number of unique words)
Classification The class that maximizes:
Summary of model and parameters • Naïve Bayes model: • Model parameters: Likelihood of spam Likelihood of ¬spam prior P(w1 | spam) P(w2 | spam) … P(wn | spam) P(w1 | ¬spam) P(w2 | ¬spam) … P(wn | ¬spam) P(spam) P(¬spam)
Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow
Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities.
Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. • Since log is a monotonic function, the class with the highest score does not change.
Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. • Since log is a monotonic function, the class with the highest score does not change. • So, what we usually compute in practice is:
Reasoning over Time • Often, we want to reason about a sequence of observations • Speech recognition • Robot localization • Medical monitoring • DNA sequence recognition • Need to introduce time into our models • Basic approach: hidden Markov models (HMMs) • More general: dynamic Bayes’ nets
Reasoning over time • Basic idea: • Agent maintains a belief state representing probabilities over possible states of the world • From the belief state and a transition model, the agent can predict how the world might evolve in the next time step • From observations and an emission model, the agent can update the belief state
Markov Models • A Markov model is a chain-structured BN • Value of X at a given time is called the state • As a BN: ….P(Xt|Xt-1)….. • Parameters: initial probabilities and transition probabilities or dynamics(specify how the state evolves over time) X1 X2 X3 X4
Conditional Independence • Basic conditional independence: • Past and future independent of the present • Each time step only depends on the previous • This is called the (first order) Markov property • Note that the chain is just a (growing) BN • We can always use generic BN reasoning on it if we truncate the chain at a fixed length X1 X2 X3 X4
Example: Markov Chain X1 X2 X3 X4 • Weather: • States: X = {rain, sun} • Transitions: • Initial distribution: 1.0 sun • What’s the probability distribution after one step? 0.1 0.9 rain sun 0.9 0.1
Mini-Forward Algorithm • Question: What’s P(X) on some day t? sun sun sun sun rain rain rain rain Forward simulation
Example • From initial observation of sun • From initial observation of rain P(X1) P(X2) P(X3) P(X) P(X1) P(X2) P(X3) P(X)
Stationary Distributions • If we simulate the chain long enough: • What happens? • Uncertainty accumulates • Eventually, we have no idea what the state is! • Stationary distributions: • For most chains, the distribution we end up in is independent of the initial distribution • Called the stationary distribution of the chain • Usually, can only predict a short time out
Hidden Markov Models • Markov chains not so useful for most agents • Eventually you don’t know anything anymore • Need observations to update your beliefs • Hidden Markov models (HMMs) • Underlying Markov chain over states S • You observe outputs (effects) at each time step • As a Bayes’ net: X1 X2 X3 X4 X5 E1 E2 E3 E4 E5
Example • An HMM is defined by: • Initial distribution: • Transitions: • Emissions:
Conditional Independence • HMMs have two important independence properties: • Markov hidden process, future depends on past via the present • Current observation independent of all else given current state • Quiz: does this mean that observations are independent given no evidence? • [No, correlated by the hidden state] X1 X2 X3 X4 X5 E1 E2 E3 E4 E5
Real HMM Examples • Speech recognition HMMs: • Observations are acoustic signals (continuous valued) • States are specific positions in specific words (so, tens of thousands) • Machine translation HMMs: • Observations are words (tens of thousands) • States are translation options • Robot tracking: • Observations are range readings (continuous) • States are positions on a map (continuous)
Filtering / Monitoring • Filtering, or monitoring, is the task of tracking the distribution B(X) (the belief state) over time • We start with B(X) in an initial setting, usually uniform • As time passes, or we get observations, we update B(X) • The Kalman filter was invented in the 60’s and first implemented as a method of trajectory estimation for the Apollo program
Example: Robot Localization Example from Michael Pfeiffer t=0 Sensor model: never more than 1 mistake Motion model: may not execute action with small prob. Prob 0 1
Example: Robot Localization t=1 Prob 0 1
Example: Robot Localization t=2 Prob 0 1
Example: Robot Localization t=3 Prob 0 1
Example: Robot Localization t=4 Prob 0 1
Example: Robot Localization t=5 Prob 0 1