Chapter 8 – Naïve Bayes

Chapter 8 – Naïve Bayes DM for Business Intelligence

Naïve Bayes Characteristics Theorem named after Reverend Thomas Bayes (1702 – 1761) Data-driven, not model-driven (i.e., lets data “create” the training model) Is Supervised Learning (i.e., defined Output Variable) Is naive since it makes the assumption that each of its inputs are independent of each other (an assumption which rarely holds true in real life) Example: an illness may be diagnosed as Flu if a patient has a virus, a fever, and congestion (even though the virus may actually be causing the other two symptoms). Also naïve because it ignores the existence of other symptoms.

Naïve Bayes: The Basic Idea For a given new record to be classified, find other records like it (i.e., same values for input variables) What is the prevalent class (i.e., Output Variable value) among those records? Assign that class to your new record

Usage • Requires Categorical input variables • Numerical input variables must be binned and converted to Categorical variables • Can be used with very large data sets • “Exact” Bayes requires that you have exact records in your dataset matching the thing you are trying to classify • Bayes can be “modified” to calculate probability of a match in your dataset

Exact Bayes Classifier Relies on finding other records that share same input variable values as record-to-be-classified. Want to find “probability of belonging to class C, given specified values of predictors/input variables” Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.

Solution – Modified Naïve Bayes • Assume independence of predictor variables (within each class) • Use multiplication rule • Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values

Modified Bayes Calculations • Take a record, and note its predictor values • Find the probabilities those predictor values occur across all records in C1 • Multiply them together, then by proportion of records belonging to C1 • Same for C2, C3, etc. • Prob. of belonging to C1 is value from step (3) divide by sum of all such values C1 … Cn • Establish & adjust a “cutoff” prob. for class of interest

Example: Financial Fraud Target variable: Audit finds “Fraud,” or “No fraud” Predictors: Prior pending legal charges (yes/no) Size of firm (small/large) Goal: Classify a firm as being Fraudulent or Truthful, if Size = “Small” and Prior Charges = “Yes”

Prior

Exact Bayes Calculations Goal: classify (as “fraudulent” or as “truthful”) a small firm with prior charges filed There are 2 firms like that, one fraudulent and the other truthful There is no “majority” class, so . . . P(fraud | prior charges=y, size=small) = ½ = 0.50 Note: calculation is limited to the two firms matching those exact characteristics

Modified Naïve Bayes Calculations Proportion of each Input Variable within Frauds Proportion of Frauds (Output) within Total Records Page 172 formula . . . P(Ci|x1…,xp) = P(x1 . . .,xp|Ci) P(Ci) P(x1 . . .,xp|Ci) P(Ci) + P(x1 . . .,xp|Cm) P(Cm) P(Fraud|Prior Charges & Small Firm) = Proportion of “Prior Charges,” Small,” & “Fraud” to Total Records Divided by . . . {same} + Proportion of “Prior Charges,” “Small,” & “Truthful” to Total Records Proportion of each Input Variable within Truthful Proportion of Truthful (Output) within Total Records

Prior 4 Fraudulent Firms and 6 Truthful Firms (out of 10 total records)

Modified Naïve Bayes Calculations Same goal as before Compute 2 quantities: Proportion of “prior charges = y” among frauds, times proportion of “small” among frauds, times proportion frauds = 3/4 * 1/4 * 4/10 = 0.075 Proportion of “prior charges = y” among truthfuls, times proportion of “small” among truthfuls, times proportion of truthfuls = 1/6 * 4/6 * 6/10 = 0.067 P(fraud | prior charges, small) = 0.075 / (0.075 + 0.067) = 0.53

Modified Naïve Bayes, cont. • Note that probability estimate does not differ greatly from exact result of .50 • But, ALL records are used in calculations, not just those matching predictor values • This makes calculations practical in most circumstances • Relies on assumption of independence between predictor variables within each class

Independence Assumption • Not strictly justified (variables often correlated with one another) • Often “good enough”

Advantages • Handles purely categorical data well • Works well with very large data sets • Simple & computationally efficient

Shortcomings • Requires large number of records • Problematic when a predictor category is not present in training data Assigns 0 probability of response, ignoring information in other variables

On the other hand… • Probability rankings are more accurate than the exact probability estimates Good for applications using lift (e.g., response to mailing), less so for applications requiring probabilities (e.g., credit scoring)

Summary • No statistical models involved • Naïve Bayes (like k-NN) pays attention to complex interactions and local structure • Computational challenges remain

Chapter 8 – Naïve Bayes

Chapter 8 – Naïve Bayes

Presentation Transcript

Hidden Markov Model

Intro to Machine Learning

PROBABILITY - Bayes’ Theorem

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Bayes Rule

NAÏVE BAYES CLASSIFIER

DATA MINING LECTURE 11

Network Theory and Dynamic Systems Information Cascades - Bayes

Bayes Net Learning

Bayes’ Theorem

Variational Bayes 101

Bayes Classification

Bayesian Supplement

Discrete Mathematics: Bayes’ Theorem and Expected Value and Variance

Bayesian Learning

Probabilistic Reasoning With Bayes’ Rule

Machine Learning Chapter 6. Bayesian Learning