NAÏVE BAYES CLASSIFIER

NAÏVE BAYESCLASSIFIER ACM Student Chapter, Heritage Institute of Technology 10th February, 2012 SIGKDD Presentation by AnirbanGhose Parami Roy SouravDutta

CLASSIFICATION • What is it? • Assigning a given piece of input data into one of a given number of categories. • e.g. : Classifying kitchen items : separating cups and saucers.

CLASSIFICATION • Why do we need it? • Separating like things from unlike things. Categorizing different types of cattle like cows, goats, etc.

CLASSIFICATION • Looking for identifiable patterns. Predicting an e-mail is spam or non-spam from patterns observed in previous mails. Automatic categorization on online articles.

Classification • Allowing extrapolation. Given the red dots predicting the value at the blue box.

Classification Techniques • Decision Tree based methods • Rule-based methods • Memory based methods • Neural Networks • Naïve Bayes Classifier • Support Vector Machines

Problem Statement

Problem Statement • Domain space : Set of values an attribute can have. • Domain space of previous example: • Outlook – {Sunny, Overcast, Rain} • Temperature – {Hot, Mild, Cool} • Humidity – {High, Normal} • Wind – {Strong, Weak} • Play Tennis – {Yes, No}

Problem Statement • Instances X : A set of items over which the concept is defined. Set of all possible days with attributes Outlook, Temperature, Humidity, Wind. • Target concept (c): concept or function to be learned. c : X → {0,1} c(x) = 1 : Play Tennis = Yes c(x) = 0 : Play Tennis = No

Problem Statement • Hypothesis (H) A statement that is assumed to be true for the sake of argument. Conjunction of constraints on the attributes. h : X → {0,1} • For each attribute hypothesis will be : ? – any value is acceptable <value> - a single required value Ø - no value is acceptable

Problem Statement • Training examples - Prior knowledge. Set of input vector (instances) and a label (outcome). Input vector Outlook - Sunny, Temperature - Hot, Humidity - High, Wind - Weak. Label Play tennis – No

Problem Statement Training examples can be : • Positive example : Instance satisfies all the constraints of hypothesis h(x) = 1 • Negative Example: Instance does not satisfy one or many constraints of hypothesis. h(x) = 0

Learning Algorithm • Naïve Bayes Classifier – Supervised Learning • Supervised Learning: machine learning task of inferring a function from supervised(labelled) training data g : X  Y X : input space Y : output space

A quick Recap • Conditional Probability : P(A/B) = P(A∩B) P(B) • Multiplication Rule: P(A∩B) = P(A/B).P(B) = P(B/A).P(A) • Independent Events: P(A∩B) = P(A).P(B) • Total Probability:

Few Important Definitions • Prior Probability: Let p be an uncertain quantity . Then prior probability is the probability distribution that would express one's uncertainty about p before the "data" is taken into account. • Posterior probability: The posterior probability of a random event or an uncertain proposition is the conditionalprobability that is assigned after the relevant evidence is taken into account

Bayes’ Theorem • P(h) : prior probability of hypothesis h • P(D) : prior probability that the training data will be observed. • P(D | h) : probability of observing data D given some world in which hypothesis h holds. • P(h | D) : posterior probability of h ( to be found). • Then as per Bayes' Theorem:

MAP HYPOTHESIS P(hi) = P(hj)

Example • A medical diagnosis problem: It has 2 alternative hypothesis: 1) Patient has a particular form of cancer 2) The patient does not have the particular form of cancer

Example - Bayes’ Theorem TEST OUTCOMES a) + (Positive - having rare disease) b) - (Negative - not having rare disease) Prior Knowledge: P(cancer) = 0.008 P(~cancer) = 0.992 P(+|cancer) = 0.98 P(-|cancer) = 0.02 P(+|~cancer) = 0.03 P(-|~cancer) = 0.97

Examples – Bayes Theorem Suppose we now observe a new patient for whom the lab test returns a positive value. Should we diagnose the patient as having cancer or not??

Solution Therefore, from the following equation: We get, P(cancer|+) = P(+|cancer).P(cancer) = (0.98).(0.008) = 0.0078 P(~cancer|+) = P(+|~cancer).P(~cancer) = (0.03).(.992) = 0.0298

Naïve Bayes Classifier • Supervised Learning Technique • Bayes Theorem • MAP Hypothesis

Naïve Bayes Classifier • Prior Knowledge • Training data set • A new instance of data. • Objective • Classify the new instance of data: <a1,a2,..,an> • Find P(vj|a1,a2,….,an) • Find the required probability for all possible classifications. • Find the maximum probability among them.

Naïve Bayes Classifier • (vj|a1,a2,. . . . ,an) for all vj in V • Using Bayes’ Theorem (vj|a1,a2,. . . . ,an)

Naïve Bayes Classifier • Why Naïve? • Assume all attributes to be conditionally independent. • P(a1,a2,…,an|vj) = P(ai|vj) for all i=1 to n • VNB = max of P(vj) P(ai|vj) for all vj in V

New Instance: < Sunny, Cool, High, Strong >

Probability Estimate • We define our probability estimate to be the frequency of data combinations within the training examples • P(vj) =Fraction of times vj occurs in the training set. • P(ai|vj) = Fraction of times ai occurs in those examples which are classified as vj

Example • Let’s calculate P(Overcast | Yes) • Number of training examples classified as Yes = 9 • Number of times Outlook = Overcast given the classification is Yes = 4 • Hence, P(Overcast | Yes) = 4/9

Prior Probability • P(Yes) = 9/14 i.e. P(playing tennis) • P(No) = 5/14 i.e. P(not playing tennis) • Look up Tables

Drawback of the estimate • What happens if the probability estimate is zero? • The estimate is zero when a particular attribute value never occurs in the training data set given the classification. • This estimate will ultimately dominate the product term VNB for that particular classification.

Example • For a new training set, the attribute outlook does not have the value overcast when the example is labeled yes. • P(Overcast | Yes) = 0 • VNB = P(Yes) * P(Overcast | Yes)*P(Cool | Yes)…. = 0

Solution • m-estimate of probability • P(ai | Vj) = Where m is the equivalent sample size P is the prior estimate of the attribute value

Disadvantages of Naïve Bayes Classifier • Require initial knowledge about many probabilities. • Significant computational cost needed to determine Bayes optimal hypothesis.

Conclusion • Naïve Bayes based on the independence assumption • Training is very easy and fast • Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions • A popular generative model • Performance competitive to most of state-of-the-art classifiers even in presence of violating independence assumption • Many successful applications, e.g., spam mail filtering • A good candidate of a base learner in ensemble learning

NAÏVE BAYES CLASSIFIER