250 likes | 478 Views
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Pedro Domingos, Michael Pazzani. Presented by Lu Ren Oct. 1, 2007. Introduction to the simple Bayesian classifier and its optimality. The simple Bayesian classifier in machine learning and
E N D
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007
Introduction to the simple Bayesian classifier and its • optimality. • The simple Bayesian classifier in machine learning and • the empirical evidence. • Optimality without independence and its conditions • When will the Bayesian classifier outperform • other learners? • How is the Bayesian classifier best extended? • Conclusions.
1. Introduction Many classifiers can be viewed as computing a set of discriminant functions of the example. If We choose class for the example . In which is a vector of attributes . Zero-one loss is minimized if and only if E is assigned to the class for which is maximum. If the attributes are independent given the class
Question: whether the simple Bayesian classifier (BC) can be optimal even when the assumption of attribute independence does not hold . The tacit assumption for this question is “No”. However, the BC can perform very well even in the domain where attribute dependences exist. This article derives the most general conditions for the BC’s optimality and give a corollary as follows: The BC’s true region of optimal performance is far greater than that implied by the attribute independence assumption.
2. The simple BC in machine learning • To be compared with more sophisticated algorithms, the simple BC was the most accurate one overall. • The BC’s limited performance in many domains was not in fact intrinsic to it, but due to the unwarranted Gaussian assumptions. • It is not always to be helpful to improve the accuracy by dealing with attribute dependences. • The simple Bayesian classifier is more robust in accuracy even compared with Bayesian networks.
Empirical evidence: • Numeric attributes were discretized for verifying the Bayesian classifier performance. • Zero counts problem is solved by the Laplace correction: • The uncorrected estimate of is . • The corrected estimate is • Missing values were ignored. • Experiment results: • twenty-eight data sets; of the data for training; • Twenty runs were conducted to show the average accuracy • and confidence levels. Where is the number of values of attribute with
Dependencies between pairs of attributes given the class: • The BC achieves higher accuracy than more sophisticated approaches in many domains. • The attribute dependence is not a good predictor of the BC’s different performance.
3. Optimality without independence • Just consider three attributes A,B and C, two classes “+” and “-”( ). • Assume A=B and A is independent with C. • The optimal classification procedure: • Assign to class “+” if • The BC will assign to “+” if Let and , then the two classification procedure can be represented: ( optimal) (Simple Bayes)
Local optimality Definition 1: zero-one loss
A more generally definition: Definition 2: Bayes rate Lowest zero-one loss achievable by any classifier on the example Definition 3: locally optimal If the classifier for a given example has a zero-one loss equal to the Bayes rate. Definition 4: globally optimal If the classifier for every example in the sample data is locally optimal; A classifier is globally optimal for a given problem iff it is globally optimal for all possible samples of that problem.
zero-one loss for classification V.S. the square error loss for probability estimation: • Equation 2 yields minimal square-error estimates of the class probability only when the estimates are equal to the true values (i.i.d. assumption holds). • But with equation 1, it can still yield minimal zero-one loss as long as the class with highest estimated probability, , is the class with highest true probability.
Consider the two-class in general: “+” and “-”. A necessary and sufficient condition for the local optimality of BC is as follows: Theorem1: The Bayesian classifier is locally optimal under zero-one loss for an example E iff for E: Corollary1: The Bayesian classifier is locally under zero-one loss in half the volume of the space of possible values of . It is not an asymptotic result, also valid for finite samples.
Under squared error loss, Eq 2 is optimal only when the i.i.d. holds: r=p & s=1-p (intersect line). Incorrectly applying intuitions based on SE loss to the BC’s performance under zero-one loss.
Global optimality: Theorem 2: The Bayesian classifier is globally optimal under zero-one loss for a sample (data set) iff • Necessary conditions: Theorem 3: The Bayesian classifier cannot be globally optimal for more than different problems. **d is the number of different numbers representable on the machine implementing the Bayesian classifier. For example: 16 bits, d= .** Theorem 4: When all attributes are nominal, the Bayesian classifier is not globally optimal for classes that are not discriminable by linear functions of the corresponding features.
The Bayesian classifier is equivalent to a linear machine, whose discriminant function for class is But it fails for concepts even they are linearly seperable. m-of-n concept is true if m or more out of the n attributes defining the example space are true. Theorem 5: The Bayesian classifier is not globally optimal for m-of-n concepts. : probability that an attribute A is true given the concept C is true,
If the Bayesian classifier is trained with all examples of an m-of-n concept, and a test example has j true-valued attributes, then the BC will make a false positive error if is positive and ; a false negative error if is negative and .
Sufficient conditions: • Theorem6: The Bayesian classifier is globally optimal if for all classes and examples , Theorem7: The Bayesian classifier is globally optimal for learning conjunctions of literals. Theorem8: The Bayesian classifier is globally optimal for learning disjunctions of literals. 4. When will the BC outperform other learners? The squared error loss=noise+ statistical bias + the variance BC is often a more accurate classifier than C4.5 because a classifier with high bias and low variance will tend to produce lower zero-one loss.
16 attributes 32 attributes 64 attributes
When the sample is the dominant limiting factor, BC may be better; However, as the sample size increases, the BC’s capacity to store information will be exhausted sooner than that of more powerful classifiers, the more powerful classifiers are better. 5. How is the BC best extended? • Detecting the attribute dependences is not necessarily the best way to improve performance. • Two measures for determining the best pair were compared: • leave-one-out cross validation on the training set. • Equation 4 to find the attributes had the largest violation of the conditional independence assumption.
Accuracy on the test set vs. accuracy estimation on the training set Entropy representing the correlation degree of features Cross-validation accuracy is a better predictor of the effect of an attribute join than the degree of dependence given the class.
Under zero-one loss, the Bayesian classifier can tolerate some significant violations of the i.i.d assumption, an approach that directly estimates the effect of the possible changes on this loss measure resulted in a more substantial improvement.
6. Conclusions. • Verify that the BC performs quite well even strong attribute dependences are present. • Derive some necessary and sufficient conditions for the BC’s optimality. • Hypothesized that the BC may often be a better classifier than more powerful alternatives when the sample size is small. • Verify that searching for attribute dependences is not necessarily the best approach to improve the BC’s performance.