200 likes | 213 Views
Learn about the Naïve Bayes classifier, a system that categorizes instances based on their feature values. Discover how to use this classifier for business intelligence, and explore its benefits and limitations.
E N D
Naïve-Bayes Classifiers Business Intelligence for Managers
Classifier definition, revisited • A classifier is a system that categorizes instances • Inputs to a classifier: feature/attribute values of a given instance • Output of a classifier: predicted category for that instance • Classifier algorithm often based on a training data set of instances with known categories
Classifiers X1 X2 feature values Y X3 category … Classifier Xn DB collection of instanceswith known categories Example:X1 (motility) = “flies”, X2 (number of legs) = 2, X3 (height) = 6 in Y = “bird”
Classifier Algorithms • K Nearest Neighbors (kNN) • Naïve-Bayes • Decision trees • Many others (support vector machines, neural networks, genetic algorithms, etc)
Classifier algorithm(approach 1) • Select all instances in the dataset that match the input tuple (X1,X2,…,Xn) of feature values • Determine the distribution of Y-values for all the matches • Output the Y-value representing the most instances
Problems with this approach • Classification process is proportional to dataset size • Not practical if the dataset is huge
Pre-computing distributions (approach 2) • What if we pre-compute all distributions for all possible tuples? • The classification process is then a simple matter of looking up the pre-computed distribution • Time complexity burden will be in the pre-computation stage, done only once • Still not practical if the number of features is not small • Suppose there are only two possible values per feature and there are n features -> 2n possible combinations!
What we need • Typically, n (number of features) will be in the hundreds and m (number of instances in the dataset) will be in the tens of thousands • We want a classifier that pre-computes enough so that it does not need to scan through the instances during the query, but we do not want to pre-compute too many values
Probability Notation • What we want to estimate from our dataset is a conditional probability • P( Y=c | X1=v1, X2=v2, …, Xn = vn ) represents the probability that the category of the instance is c, given that the feature values are v1,v2,…,vn (the input) • In our classifier, we output the c with maximum probability
Bayes Theorem • Bayes theorem allows us to invert conditional probability • P( A=a | B=b ) =P( B=b | A=a ) P( A=a ) P( B=b ) • Why and how this will help? • The answer will come later
P( A=a ) W X Z Y P( B=b ) Suppose U = W+X+Y+Z P( A=a | B=b ) = Z/(Z+Y) P( B=b | A=a ) = Z/(Z+X) P( A=a ) = (Z+X)/UP( B=b ) = (Z+Y)/U P( A=a ) / P( B=b ) = (Z+X)/(Z+Y) P( B=b | A=a ) P( A=a ) P( B=b ) = [ Z/(Z+X) ] (Z+X)/(Z+Y) = Z/(Z+Y)= P( A=a | B=b )
Another helpful equivalence • Assuming that two events are independent, the probability that both events occur is equal to the product of their individual probabilities • P( X1=v1, X2=v2 ) = P( X1=v1 ) P( X2=v2 )
The critical step Goal: maximize this quantity over all possible Y-values P( Y=c | X1=v1, X2=v2, …, Xn=vn ) = P( X1=v1, X2=v2, …, Xn = vn | Y=c ) P( Y=c ) P(X1=v1, X2=v2, …, Xn = vn) P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P( Y=c ) P(X1=v1, X2=v2, …, Xn = vn) Can ignore the divisor since it remains the same regardless of Y-value
And here it is… We want a classifier to compute max P( Y=c | X1=v1, X2=v2, …, Xn = vn ) We get the same c if we instead compute max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P(Y=c) These values can be pre-computed and the number of computations is not combinatorially explosive
Building a classifier(approach 3) • For each category c, estimate P( Y=c ) = number of c-instances total number of instances • For each category c, for each feature Xi, determine the distribution P( Xi | Y=c ) • For each possible value v for Xi, estimateP( Xi=v | Y=c ) = number of c-instances where Xi=vnumber of c-instances
Using the classifier(approach 3) • For a given input tuple (v1,v2,…,vn), determine the category c that yields max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c)P(Y=c) by looking up the terms from the pre-computed values • Output category c
Example • Suppose we wanted a classifier that categorizes organisms according to certain characteristics • Organism categories (Y) are: mammal, bird, fish, insect, spider • Characteristics (X1,X2,X3,X4): motility (walks, flies, swim), number of legs (2,4,6,8), size (small, large), body-covering (fur, scales, feathers) • The dataset contains 1000 organism samples • m = 1000, n = 4, number of categories = 5
Comparing approaches • Approach 1: requires scanning all tuples for matching feature values • entails 1000*4 = 4000 comparisons per query, count occurrences of each category • Approach 2: pre-compute probabilities • Preparation: for each of the 3*4*2*3 = 64 combinations, determine the probability for each category (64*5=320 computations) • Query: straightforward lookup of answer
Comparing approaches • Approach 3: Naïve Bayes classifier • Preparation: compute P(Y=c) probabilities: 5 of them; computeP( Xi=v | Y=c ),5*(3+4+2+3)=60 of them • Query: straightforward computation of 5 probabilities, determine maximum, return category that yields the maximum
About the Naïve Bayes Classifier • Computations and resources required are reasonable, both for the preparatory stage and actual query stage • Even if the number n of features is in the thousands! • The classifier is naïve because it assumes independence of features (this is likely not the case) • It turns out that the classifier works well in practice even with this limitation • Log of probabilities are often used instead of actual probabilities to avoid underflow when computing the probability products