Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University

Some Recent Work • Learning from Sequences of fMRI Brain Images (with Tom Mitchell) • Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic) • Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research) • Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang) • Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam) • Error-Correcting Output Codes for Text Classification

Domains: • Topics • Genres • Languages $$$Making • Numerous Applications • Search Engines/Portals • Customer Service • Email Routing …. Text Categorization

Problems • Practical applications such as web portal deal with a large number of categories • A lot of labeled examples are needed for training the system

How do people deal with a large number of classes? • Use fast multiclass algorithms (Naïve Bayes) • Builds one model per class • Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems • What happens with a 1000 class problem? • Can we do better?

ECOC to the Rescue! • An n-class problem can be solved by solving log2n binary problems • More efficient than one-per-class • Does it actually perform better?

What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems

Testing ECOC Training ECOC f1 f2 f3 f4 00 1 1 10 1 0 0111 01 00 A B C D 11 11 X

ECOC - Picture f1 f2 f3 f4 A B C D 00 1 1 10 1 0 0111 01 00 A B C D

ECOC - Picture f1 f2 f3 f4 A B C D 00 1 1 10 1 0 0111 01 00 A B C D X 1 1 11

ECOC works but… • Increased code length = Increased Accuracy • Increased code length = Increased Computational Cost

Efficiency Naïve Bayes GOAL ECOC (as used in Berger 99) Classification Performance

Choosing the codewords • Random? [Berger 1999, James 1999] • Asymptotically good (the longer the better) • Computational Cost is very high • Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] • Guaranteed properties for a fixed-length code

Experimental Setup • Generate the code • BCH Codes • Choose a Base Learner • Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)

Text Classification with Naïve Bayes • “Bag of Words” document representation • Estimate parameters of generative model: • Naïve Bayes classification:

Industry Sector Dataset [McCallum et al. 1998, Ghani 2000] • Consists of company web pages classified into 105 economic sectors

Results Industry Sector Data Set ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost • (McCallum et al. 1998) 2,3. (Nigam et al. 1999)

ECOC for better Precision

New Goal Efficiency NB GOAL ECOC (as used in Berger 99) Classification Performance

Solutions • Design codewords that minimize cost and maximize “performance” • Investigate the assignment of codewords to classes • Learn the decoding function • Incorporate unlabeled data into ECOC

What happens with sparse data?

Use unlabeled data with a large number of classes • How? • Use EM • Mixed Results • Think Again! • Use Co-Training • Disastrous Results • Think one more time

How to use unlabeled data? • Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories • ECOC works great with a large number of classes but there is no framework for using unlabeled data

ECOC + CoTraining = ECoTrain • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

Algorithm 300L+ 0UPer Class 50L + 250UPer Class 5L + 295UPer Class Naïve Bayes Uses No Unlabeled Data 76 67 40.3 ECOC 15bit 76.5 68.5 49.2 EM Uses Unlabeled Data - 105Class Problem 68.2 51.4 Co-Train 67.6 50.1 ECoTrain(ECOC + Co-Training) Uses Unlabeled Data 72.0 56.1 ECOC+CoTrain - Results

What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Potential Drawbacks • Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

Summary • Use ECOC for efficient text classification with a large number of categories • Increase Accuracy & Efficiency • Use Unlabeled data by combining ECOC and Co-Training • Generalize to domain-independent classification tasks involving a large number of categories

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Presentation Transcript

Error Detecting and Error Correcting Codes

Error correcting codes

Section 3.5: Error-Correcting Codes

Error-Correcting Codes and Frames with Erasures

Hardware accelerator for Efficient error-correcting codes

Error Correcting Codes

Optimization of Watermarking Performances using Error Correcting Codes and Repetition

A Study of Error-Correcting Codes for Quantum Adiabatic Computing

Error-Correcting Codes for TLC Flash

Error Correcting Codes

Error correcting codes

ENEE 626: Error Correcting Codes

Using Error-Correcting Codes For Text Classification

Efficient Text Categorization with a Large Number of Categories

Using Error-Correcting Codes For Text Classification

Error Correcting Codes

Error correcting codes

Combining labeled and unlabeled data for text categorization with a large number of categories

Introduction to Error Correcting Codes

ERROR-DETECTING AND ERROR- CORRECTING CODES

Error Correcting Codes

Error-Detecting and Error-Correcting Codes