300 likes | 472 Views
Using Error-Correcting Codes For Text Classification. Rayid Ghani rayid@cs.cmu.edu. This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/. Outline. Introduction to ECOC Intuition & Motivation Some Questions? Experimental Results Semi-Theoretical Model Types of Codes
E N D
Using Error-Correcting Codes For Text Classification Rayid Ghani rayid@cs.cmu.edu This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/
Outline • Introduction to ECOC • Intuition & Motivation • Some Questions? • Experimental Results • Semi-Theoretical Model • Types of Codes • Drawbacks • Conclusions
Introduction • Decompose a multiclass classification problem into multiple binary problems • One-Per-Class Approach (moderately expensive) • All-Pairs (very expensive) • Distributed Output Code (efficient but what about performance?) • Error-Correcting Output Codes (?)
Is it a good idea? • Larger margin for error since errors can now be “corrected” • One-per-class is a code with minimum hamming distance (HD) = 2 • Distributed codes have low HD • The individual binary problems can be harder than before • Useless unless number of classes > 5
Training ECOC • Given m distinct classes • Create an m x n binary matrix M. • Each class is assigned ONE row of M. • Each column of the matrix divides the classes into TWO groups. • Train the Base classifiers to learn the n binary problems.
Testing ECOC • To test a new instance • Apply each of the n classifiers to the new instance • Combine the predictions to obtain a binary string(codeword) for the new point • Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D X 1 1 110
Questions? • How well does it work? • How long should the code be? • Do we need a lot of training data? • What kind of codes can we use? • Are there intelligent ways of creating the code?
Previous Work • Combine with Boosting – ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999) • Local Learners (Ricci & Aha, 1997) • Text Classification (Berger, 1999)
Experimental Setup • Generate the code • BCH Codes • Choose a Base Learner • Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)
Dataset • Industry Sector Dataset • Consists of company web pages classified into 105 economic sectors • Standard stoplist • No Stemming • Skip all MIME headers and HTML tags • Experimental approach similar to McCallum et al. (1998) for comparison purposes.
Results Industry Sector Data Set ECOC reduces the error of the Naïve Bayes Classifier by 66% • (McCallum et al. 1998) 2,3. (Nigam et al. 1999)
The Longer the Better! • Longer codes mean larger codeword separation • The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C • If minimum hamming distance is h, then the code can correct (h-1)/2 errors Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain.
Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly
Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly
Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly
Types of Codes Types of Codes • Data-Independent • Data-Dependent Hand-Constructed Adaptive Algebraic Random
What is a Good Code? • Row Separation • Column Separation (Independence of errors for each binary classifier) • Efficiency (for long codes)
Drawbacks • Can be computationally expensive • Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems
Future Work • Combine ECOC with Co-Training • Automatically construct optimal / adaptive codes
Conclusion • Improves Classification Accuracy considerably! • Can be used when training data is sparse • Algebraic codes perform better than random codes for a given code lenth • Hand-constructed codes are not the answer