Data Mining: naïve Bayes

Data Mining: naïve Bayes

Naïve Bayes Classifier Thomas Bayes 1702 - 1761 We will start off with some mathematical background. But first we start with some visual intuition.

10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Grasshoppers Katydids Antenna Length Abdomen Length Remember this example? Let’s get lots more data…

10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Katydids Grasshoppers With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now… Antenna Length

We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us us two normal distributions for ease of visualization in the following slides…

We want to classify an insect we have found. Its antennae are 3 units long. • How can we classify it? • We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopperor a Katydid. • There is a formal way to discuss the most probable classification… P(cj|d) = probability of cj given that we have observed d 3 Antennae length is 3

P(cj|d) = probability of cj given that we have observed d P(Grasshopper | 3 ) = 10 / (10 +2) = 0.833 P(Katydid | 3 ) = 2 / (10 + 2) = 0.166 10 2 3 Antennae length is 3

Bayes Classifier • A probabilistic framework for classification problems • Often appropriate because the world is noisy and also some relationships are probabilistic in nature • Is predicting who will win a baseball game probabilistic in nature? • Before getting the heart of the matter, we will go over some basic probability. • We will review the concept of reasoning with uncertainty, which is based on probability theory • Should be review for many of you

Discrete Random Variables • A is a Boolean-valued random variable if Adenotes an event, and there is some degree of uncertainty as to whether Aoccurs. • Examples • A= The next patient you examine is suffering from inhalational anthrax • A= The next patient you examine has a cough • A= There is an active terrorist cell in your city • We view P(A) as “the fraction of possible worlds in which A is true”

Visualizing A Event space of all possible worlds P(A) = Area of reddish oval Worlds in which A is true Its area is 1 Worlds in which A is False

The Axioms Of Probability • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true

Interpreting the axioms • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true

Interpreting the axioms • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) A P(A or B) B B P(A and B) Simple addition and subtraction

Another Important Theorem • 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) From these we can prove: P(A) = P(A and B) + P(A and not B) A B

Conditional Probability • P(A|B) = Fraction of worlds in which Bis true that also have A true H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 “Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.” F H

F H Conditional Probability P(H|F) = Fraction of flu-inflicted worlds in which you have a headache = #worlds with flu and headache ------------------------------------ #worlds with flu = Area of “H and F” region ------------------------------ Area of “F” region = P(H and F) --------------- P(F) H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2

Definition of Conditional Probability P(A and B) P(A|B) = ----------- P(B) Corollary: The Chain Rule P(A and B) = P(A|B) P(B)

F H Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good?

F H Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 P(F and H) = … P(F|H) = …

F H Probabilistic Inference H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2

What we just did… P(A & B) P(A|B) P(B) P(B|A) = ----------- = --------------- P(A) P(A) This is Bayes Rule Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

More Terminology • The Prior Probability is the probability assuming no specific information. • Thus we would refer to P(A) as the prior probability of even A occurring • We would not say that P(A|C) is the prior probability of A occurring • The Posterior probability is the probability given that we know something • We would say that P(A|C) is the posterior probability of A (given that C occurs)

Example of Bayes Theorem • Given: • A doctor knows that meningitis causes stiff neck 50% of the time • Prior probability of any patient having meningitis is 1/50,000 • Prior probability of any patient having stiff neck is 1/20 • If a patient has stiff neck, what’s the probability he/she has meningitis?

Why Bayes Theorem at All? • Why model P(C|A) via P(A|C) • We will see it is easier, but only with significant assumptions • In classification, what is C and what is A? • C is class and A is the example, a vector of attribute values • Why not model P(C|A) directly?How would we compute it? • We would need to observe A at least once and probably many times in order to come up with reasonable probability estimates. If we observe it once, we would have a probability of 1 for some C and 0 for rest. • We cannot expect to see every attribute vector even once!

Bayes Classifiers • That was a visual intuition for a simple case of the Bayes classifier, also called: • Idiot Bayes • Naïve Bayes • Simple Bayes • We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. • Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.

Bayesian Classifiers • Bayesian classifiers use Bayes theorem, which says p(cj| d ) = p(d | cj) p(cj) p(d)p(cj| d) = probability of instance d being in class cj, This is what we are trying to compute • p(d | cj) = probability of generating instance d given class cj, We can imagine that being in class cj, causes you to have feature d with some probability • p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database • p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

Bayesian Classifiers • Given a record with attributes (A1, A2,…,An) • The goal is to predict class C • Actually, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly (w/o Bayes)? • Yes, we simply need to count up the number of times we see A1, A2,…,An and then see what fraction belongs to each class • For example, if n=3 and the feature vector “4,3,2” occurs 10 times and 4 of these belong to C1 and 6 to C2, then: • What is P(C1|”4,3,2”)? • What is P(C2|”4,3,2”)? • Unfortunately, this is generally not feasible since not every feature vector will be found in the training set (as we just said)

Bayesian Classifiers • Indirect Approach: Use Bayes Theorem • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem • Choose value of C that maximizes P(C | A1, A2, …, An) • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • Since the denominator is the same for all values of C

Naïve Bayes Classifier • How can we estimate P(A1, A2, …, An |C)? • We can measure it directly, but only if the training set samples every feature vector. Not practical! Not easier than measuring P(C| P(A1, A2, …, An) • So, we must assume independence among attributes Ai when class is given: • P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) • Then can we directly estimate P(Ai| Cj) for all Ai and Cj? • Yes because we are looking only at one feature at a time. We can expect each feature value to appear many times in training data. • New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

(Note: “Drew can be a male or female name”) Assume that we have two classes c1 = male, and c2 = female. We have a person whose sex we do not know, say “drew” or d. Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, I.e which is greater p(male| drew) or p(female| drew) Drew Barrymore Drew Carey What is the probability of being called “drew” given that you are a male? What is the probability of being a male? What is the probability of being named “drew”? (actually irrelevant, since it is that same for all classes) p(male| drew) = p(drew | male) p(male) p(drew)

This is Officer Drew (who arrested me in 1997). Is Officer Drew a Male or Female? Luckily, we have a small database with names and sex. We can use it to apply Bayes rule… Officer Drew p(cj| d) = p(d | cj) p(cj) p(d)

p(cj| d) = p(d | cj) p(cj) p(d) Officer Drew p(male| drew) = 1/3 * 3/8 = 0.125 = 0.333 3/83/8 Officer Drew is more likely to be a Female. p(female| drew) = 2/5 * 5/8 = 0.250 = .666 3/83/8

Officer Drew IS a female! So far we have only considered Bayes Classification when we have one attribute (the “antennae length”, or the “name”). In this case there is no real benefit for using Naïve Bayes. But in classification we usually have many features. How do we use all the features? Officer Drew

p(cj| d) = p(d | cj) p(cj) p(d)

To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) The probability of class cj generating instance d, equals…. The probability of class cj generating the observed value for feature 1, multiplied by.. The probability of class cj generating the observed value for feature 2, multiplied by..

Naïve Bayes is fast and space efficient We can look up all the probabilities with a single scan of the database and store them in a (small) table… …

Naïve Bayes is NOT sensitive to irrelevant features... Suppose we are trying to classify a persons sex based on several features, including eye color. (eye color is irrelevant to a persons gender) • p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * …. • p(Jessica | Female) = 9,000/10,000 * 9,975/10,000 * …. • p(Jessica | Male) = 9,001/10,000 * 2/10,000 * …. Almost the same! However, this assumes that we have good enough estimates of the probabilities, so the more data the better.

An obvious point. I have used a simple two class problem, and two possible values for each example, for my previous examples. However we can have an arbitrary number of classes, or feature values

Naïve Bayesian Classifier Problem! Naïve Bayes assumes independence of features… Are height and weight independent? Naïve Bayes tends to work well anyway and is competitive with other methods

10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 The Naïve Bayesian Classifierhas a quadratic decision boundary

How to Estimate Probabilities from Data? • For continuous attributes: • Discretize the range into bins • Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute • Creates a binary feature • Probability density estimation: • Assume attribute follows a normal distribution and use the data to fit this distribution • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) • We will not deal with continuous values on HW or exam • Just understand the general ideas above k

Example of Naïve Bayes • We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? • Here is the feature vector: • Refund = No, Married, Income = 120K • Now what do we do? • First try writing out the thing we want to measure

Example of Naïve Bayes • We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? • Here is the feature vector: • Refund = No, Married, Income = 120K • Now what do we do? • First try writing out the thing we want to measure • P(Evade|[No, Married, Income=120K]) • Next, what do we need to maximize?

Example of Naïve Bayes • We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? • Here is the feature vector: • Refund = No, Married, Income = 120K • Now what do we do? • First try writing out the thing we want to measure • P(Evade|[No, Married, Income=120K]) • Next, what do we need to maximize? • P(Cj)  P(Ai| Cj)

Example of Naïve Bayes • Since we want to maximize P(Cj)  P(Ai| Cj) • What quantities do we need to calculate in order to use this equation? • Someone come up to the board and write them out, without calculating them • Recall that we have three attributes: • Refund: Yes, No • Marital Status: Single, Married, Divorced • Taxable Income: 10 different “discrete” values • While we could compute every P(Ai| Cj) for all Ai, we only need to do it for the attribute values in the test example

Values to Compute • Given we need to compute P(Cj)  P(Ai| Cj) • We need to compute the class probabilities • P(Evade=No) • P(Evade=Yes) • We need to compute the conditional probabilities • P(Refund=No|Evade=No) • P(Refund=No|Evade=Yes) • P(Marital Status=Married|Evade=No) • P(Marital Status=Married|Evade=Yes) • P(Income=120K|Evade=No) • P(Income=120K|Evade=Yes)

Data Mining: naïve Bayes