240 likes | 375 Views
Probabilistic inference. Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence E = e Partially observable, stochastic, episodic environment
E N D
Probabilistic inference • Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidenceE = e • Partially observable, stochastic, episodic environment • Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features • Bayes decision theory: • The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise • The estimate of X that minimizes expected loss is the one that has the greatest posterior probability P(X = x | e) • This is the Maximum a Posteriori (MAP) decision
MAP decision • Value of x that has the highest posterior probability given the evidence e
MAP decision • Value of x that has the highest posterior probability given the evidence e posterior likelihood prior
MAP decision • Value of x that has the highest posterior probability given the evidence e • Maximum likelihood (ML) decision: posterior likelihood prior
Example: Naïve Bayes model • Suppose we have many different types of observations (symptoms, features) F1, …, Fn that we want to use to obtain evidence about an underlying hypothesis H • MAP decision involves estimating • If each feature can take on k values, how many entries are in the joint probability table?
Example: Naïve Bayes model • Suppose we have many different types of observations (symptoms, features) F1, …, Fn that we want to use to obtain evidence about an underlying hypothesis H • MAP decision involves estimating • We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: • If each feature can take on k values, what is the complexity of storing the resulting distributions?
Naïve Bayes Spam Filter • MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)
Naïve Bayes Spam Filter • MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) • We have P(spam | message) P(message | spam)P(spam)and ¬P(spam | message) P(message | ¬spam)P(¬spam)
Naïve Bayes Spam Filter • We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) • The message is a sequence of words (w1, …, wn) • Bag of words representation • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam)
Naïve Bayes Spam Filter • We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) • The message is a sequence of words (w1, …, wn) • Bag of words representation • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam) • Our filter will classify the message as spam if
Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
Naïve Bayes Spam Filter posterior prior likelihood
Parameter estimation • In order to classify a message, we need to know the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam) • These are the parameters of the probabilistic model • How do we obtain the values of these parameters? prior P(word | spam) P(word | ¬spam) spam: 0.33 ¬spam: 0.67
Parameter estimation • How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? • Empirically: use training data • This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data: # of word occurrences in spam messages P(word | spam) = total # of words in spam messages d: index of training document, i: index of a word
Parameter estimation • How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? • Empirically: use training data • Parameter smoothing: dealing with words that were never seen or seen too few times • Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did # of word occurrences in spam messages P(word | spam) = total # of words in spam messages
Summaryof model and parameters • Naïve Bayes model: • Model parameters: Likelihood of spam Likelihood of ¬spam prior P(w1 | spam) P(w2 | spam) … P(wn | spam) P(w1 | ¬spam) P(w2 | ¬spam) … P(wn | ¬spam) P(spam) P(¬spam)
Bag-of-word models for images Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)
Bag-of-word models for images • Extract image features
Bag-of-word models for images • Extract image features
Bag-of-word models for images • Extract image features • Learn “visual vocabulary”
Bag-of-word models for images • Extract image features • Learn “visual vocabulary” • Map image features to visual words
Bayesian decision making: Summary • Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variableE • Inference problem: given some evidence E = e, what is P(X | e)? • Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample{(x1,e1), …, (xn,en)}