Using Error-Correcting Codes For Text Classification

Using Error-Correcting Codes For Text Classification Rayid Ghani rayid@cs.cmu.edu This presentation can be accessed at http://www.cs.cmu.edu/~rayid/talks/

Outline • Introduction to ECOC • Intuition & Motivation • Some Questions? • Experimental Results • Semi-Theoretical Model • Types of Codes • Drawbacks • Conclusions

Introduction • Decompose a multiclass classification problem into multiple binary problems • One-Per-Class Approach (moderately expensive) • All-Pairs (very expensive) • Distributed Output Code (efficient but what about performance?) • Error-Correcting Output Codes (?)

Is it a good idea? • Larger margin for error since errors can now be “corrected” • One-per-class is a code with minimum hamming distance (HD) = 2 • Distributed codes have low HD • The individual binary problems can be harder than before • Useless unless number of classes > 5

Training ECOC • Given m distinct classes • Create an m x n binary matrix M. • Each class is assigned ONE row of M. • Each column of the matrix divides the classes into TWO groups. • Train the Base classifiers to learn the n binary problems.

Testing ECOC • To test a new instance • Apply each of the n classifiers to the new instance • Combine the predictions to obtain a binary string(codeword) for the new point • Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D X 1 1 110

Questions? • How well does it work? • How long should the code be? • Do we need a lot of training data? • What kind of codes can we use? • Are there intelligent ways of creating the code?

Previous Work • Combine with Boosting – ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999) • Local Learners (Ricci & Aha, 1997) • Text Classification (Berger, 1999)

Experimental Setup • Generate the code • BCH Codes • Choose a Base Learner • Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)

Dataset • Industry Sector Dataset • Consists of company web pages classified into 105 economic sectors • Standard stoplist • No Stemming • Skip all MIME headers and HTML tags • Experimental approach similar to McCallum et al. (1998) for comparison purposes.

Results Industry Sector Data Set ECOC reduces the error of the Naïve Bayes Classifier by 66% • (McCallum et al. 1998) 2,3. (Nigam et al. 1999)

The Longer the Better! • Longer codes mean larger codeword separation • The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C • If minimum hamming distance is h, then the code can correct  (h-1)/2 errors Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain.

Size Matters?

Size does NOT matter!

Semi-Theoretical Model • Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

Types of Codes Types of Codes • Data-Independent • Data-Dependent Hand-Constructed Adaptive Algebraic Random

What is a Good Code? • Row Separation • Column Separation (Independence of errors for each binary classifier) • Efficiency (for long codes)

Choosing Codes

Experimental Results

Drawbacks • Can be computationally expensive • Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

Future Work • Combine ECOC with Co-Training • Automatically construct optimal / adaptive codes

Conclusion • Improves Classification Accuracy considerably! • Can be used when training data is sparse • Algebraic codes perform better than random codes for a given code lenth • Hand-constructed codes are not the answer

Using Error-Correcting Codes For Text Classification

Using Error-Correcting Codes For Text Classification

Presentation Transcript

Error Detecting and Error Correcting Codes

Error correcting codes

Error-Correcting Codes: Classical to Quantum

Section 3.5: Error-Correcting Codes

Hardware accelerator for Efficient error-correcting codes

Error Correcting Codes

Error-Correcting Codes for TLC Flash

Error Correcting Codes

Error correcting codes

ENEE 626: Error Correcting Codes

Using Error-Correcting Codes For Text Classification

Error-Correcting Codes and Pseudorandom Projections

Error Correcting Codes

Error correcting codes

Digital Communication and Error Correcting Codes

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

An introduction to error correcting codes

Introduction to Error Correcting Codes

ERROR-DETECTING AND ERROR- CORRECTING CODES

Error Correcting Codes

Error-Detecting and Error-Correcting Codes

Error correcting codes