Bayesian Networks

Bayesian Networks Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

Announcements • Office hours today canceled (due to conference deadline) • Rohit has office hours tomorrow 4-5pm

Bayesian decision making • Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variableE • Inference problem: given some evidence E = e, what is P(X | e)? • Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples{(x1,e1), …, (xn,en)}

Learning in Bayes Nets • Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs for the Bayes Net. Might also start with priors for CPT probabilities. • Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

Parameter learning • Suppose we know the network structure (but not the parameters), and have a training set of complete observations Training set ? ? ? ? ? ? ? ? ?

Parameter learning • Suppose we know the network structure (but not the parameters), and have a training set of complete observations • P(X | Parents(X)) is given by the observed frequencies of the different values of X for each combination of parent values

Parameter learning • Incomplete observations • Expectation maximization (EM) algorithm for dealing with missing data Training set ? ? ? ? ? ? ? ? ?

Learning from incomplete data • Observe real-world data • Some variables are unobserved • Trick: Fill them in • EM algorithm • E=Expectation • M=Maximization

EM-Algorithm • Initialize CPTs (e.g. randomly, uniformly) • Repeat until convergence • E-Step: • Compute for all i and x (inference) • M-Step: • Update CPT tables

Parameter learning • What if the network structure is unknown? • Structure learning algorithms exist, but they are pretty complicated… Training set C ? S R W

Summary: Bayesian networks • Structure • Independence • Inference • Parameters • Learning

Naïve Bayes

Bayesian inference, Naïve Bayes model http://xkcd.com/1236/

Bayes Rule Rev. Thomas Bayes(1702-1761) • The product rule gives us two ways to factor a joint probability: • Therefore, • Why is this useful? • Key tool for probabilistic inference: can get diagnostic probability from causal probability • E.g.,P(Cavity | Toothache) fromP(Toothache | Cavity)

Bayes Rule example • Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes rule: Another example • 1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies.9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Probabilistic inference • Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s)E = e • Partially observable, stochastic, episodic environment • Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features

Bayesian decision theory • Let x be the value predicted by the agent and x* be the true value of X. • The agent has a loss function, which is 0 if x = x* and 1 otherwise • Expected loss: • What is the estimate of X that minimizes the expected loss? • The one that has the greatest posterior probability P(x|e) • This is called the Maximum a Posteriori (MAP) decision

MAP decision • Value x of X that has the highest posterior probability given the evidence E = e: • Maximum likelihood (ML) decision: posterior likelihood prior

Naïve Bayes model • Suppose we have many different types of observations (symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X • MAP decision involves estimating • If each feature Eican take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

Naïve Bayes model • Suppose we have many different types of observations (symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X • MAP decision involves estimating • We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: • If each feature can take on k values, what is the complexity of storing the resulting distributions?

Exchangeability • De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Naïve Bayes model • Posterior: • MAP decision: likelihood posterior prior

A Simple Example – Naïve Bayes C – Class F - Features We only specify (parameters): prior over class labels how each feature depends on the class

Slide from Dan Klein

Spam filter • MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)

Spam filter • MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) • We have P(spam | message) P(message | spam)P(spam)and P(¬spam | message) P(message | ¬spam)P(¬spam) • To enable classification, we need to be able to estimate the likelihoodsP(message | spam) andP(message | ¬spam) andpriorsP(spam) andP(¬spam)

Naïve Bayes Representation • Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) andpriorsP(spam) andP(¬spam) • Likelihood: bag of words representation • The message is a sequence of words (w1, …, wn) • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam)

Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

Naïve Bayes Representation • Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) andpriorsP(spam) andP(¬spam) • Likelihood: bag of words representation • The message is a sequence of words (w1, …, wn) • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam) • Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | spam) and P(wi | ¬spam)

Summary: Decision rule • General MAP rule for Naïve Bayes: • Thus, the filter should classify the message as spam if likelihood posterior prior

Slide from Dan Klein

Percentage of documents in training set labeled as spam/ham Slide from Dan Klein

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

Parameter estimation • Model parameters: feature likelihoods P(word | spam) andP(word | ¬spam) andpriorsP(spam) andP(¬spam) • How do we obtain the values of these parameters? • Need training set of labeled samples from both classes • This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data: # of word occurrences in spam messages P(word | spam) = total # of words in spam messages d: index of training document, i: index of a word

Parameter estimation • Parameter estimate: • Parameter smoothing: dealing with words that were never seen or seen too few times • Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did # of word occurrences in spam messages P(word | spam) = total # of words in spam messages # of word occurrences in spam messages + 1 P(word | spam) = total # of words in spam messages + V (V: total number of unique words)

Classification The class that maximizes:

Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow

Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities.

Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. • Since log is a monotonic function, the class with the highest score does not change.

Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. • Since log is a monotonic function, the class with the highest score does not change. • So, what we usually compute in practice is:

Naïve Bayes on images

Visual Categorization with Bags of KeypointsGabriella Csurka, Christopher R. Dance, Lixin Fan, JuttaWillamowski, Cédric Bray

Bayesian Networks