320 likes | 462 Views
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories. Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University. Some Recent Work. Learning from Sequences of fMRI Brain Images (with Tom Mitchell)
E N D
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University
Some Recent Work • Learning from Sequences of fMRI Brain Images (with Tom Mitchell) • Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic) • Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research) • Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang) • Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam) • Error-Correcting Output Codes for Text Classification
Domains: • Topics • Genres • Languages $$$Making • Numerous Applications • Search Engines/Portals • Customer Service • Email Routing …. Text Categorization
Problems • Practical applications such as web portal deal with a large number of categories • A lot of labeled examples are needed for training the system
How do people deal with a large number of classes? • Use fast multiclass algorithms (Naïve Bayes) • Builds one model per class • Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems • What happens with a 1000 class problem? • Can we do better?
ECOC to the Rescue! • An n-class problem can be solved by solving log2n binary problems • More efficient than one-per-class • Does it actually perform better?
What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems
Testing ECOC Training ECOC f1 f2 f3 f4 00 1 1 10 1 0 0111 01 00 A B C D 11 11 X
ECOC - Picture f1 f2 f3 f4 A B C D 00 1 1 10 1 0 0111 01 00 A B C D
ECOC - Picture f1 f2 f3 f4 A B C D 00 1 1 10 1 0 0111 01 00 A B C D
ECOC - Picture f1 f2 f3 f4 A B C D 00 1 1 10 1 0 0111 01 00 A B C D
ECOC - Picture f1 f2 f3 f4 A B C D 00 1 1 10 1 0 0111 01 00 A B C D X 1 1 11
ECOC works but… • Increased code length = Increased Accuracy • Increased code length = Increased Computational Cost
Efficiency Naïve Bayes GOAL ECOC (as used in Berger 99) Classification Performance
Choosing the codewords • Random? [Berger 1999, James 1999] • Asymptotically good (the longer the better) • Computational Cost is very high • Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] • Guaranteed properties for a fixed-length code
Experimental Setup • Generate the code • BCH Codes • Choose a Base Learner • Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)
Text Classification with Naïve Bayes • “Bag of Words” document representation • Estimate parameters of generative model: • Naïve Bayes classification:
Industry Sector Dataset [McCallum et al. 1998, Ghani 2000] • Consists of company web pages classified into 105 economic sectors
Results Industry Sector Data Set ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost • (McCallum et al. 1998) 2,3. (Nigam et al. 1999)
New Goal Efficiency NB GOAL ECOC (as used in Berger 99) Classification Performance
Solutions • Design codewords that minimize cost and maximize “performance” • Investigate the assignment of codewords to classes • Learn the decoding function • Incorporate unlabeled data into ECOC
Use unlabeled data with a large number of classes • How? • Use EM • Mixed Results • Think Again! • Use Co-Training • Disastrous Results • Think one more time
How to use unlabeled data? • Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories • ECOC works great with a large number of classes but there is no framework for using unlabeled data
ECOC + CoTraining = ECoTrain • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
Algorithm 300L+ 0UPer Class 50L + 250UPer Class 5L + 295UPer Class Naïve Bayes Uses No Unlabeled Data 76 67 40.3 ECOC 15bit 76.5 68.5 49.2 EM Uses Unlabeled Data - 105Class Problem 68.2 51.4 Co-Train 67.6 50.1 ECoTrain(ECOC + Co-Training) Uses Unlabeled Data 72.0 56.1 ECOC+CoTrain - Results
What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
Potential Drawbacks • Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems
Summary • Use ECOC for efficient text classification with a large number of categories • Increase Accuracy & Efficiency • Use Unlabeled data by combining ECOC and Co-Training • Generalize to domain-independent classification tasks involving a large number of categories