180 likes | 349 Views
Efficient Text Categorization with a Large Number of Categories. Rayid Ghani KDD Project Proposal. Domains: Topics Genres Languages. $$$ Making. Numerous Applications Search Engines/Portals Customer Service Email Routing …. Text Categorization.
E N D
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal
Domains: • Topics • Genres • Languages $$$Making • Numerous Applications • Search Engines/Portals • Customer Service • Email Routing …. Text Categorization
How do people deal with a large number of classes? • Use fast multiclass algorithms (Naïve Bayes) • Builds one model per class • Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems • What happens with a 1000 class problem? • Can we do better?
ECOC to the Rescue! • An n-class problem can be solved by solving log2n binary problems • More efficient than one-per-class • Does it actually perform better?
What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems
Testing ECOC Training ECOC f1 f2 f3 f4 f5 00 1 1 0 10 1 0 0 01110 01 001 A B C D 11 110 X
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D
ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D X 1 1 110
This Proposal Efficiency NB Preliminary Results ECOC (as used in Berger 99) Classification Performance Preliminary Results: ECOC reduces the error of the Naïve Bayes Classifier by 66% with NO increase in computational cost
Proposed Solutions • Design codewords that minimize cost and maximize “performance” • Investigate the assignment of codewords to classes • Learn the decoding function • Incorporate unlabeled data into ECOC
Use unlabeled data with a large number of classes • How? • Use EM • Mixed Results • Think Again! • Use Co-Training • Disastrous Results • Think one more time
Use Unlabeled data • Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories • ECOC works great with a large number of classes but there is no framework for using unlabeled data
Use Unlabeled Data • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training • Preliminary Results: Not so great! (very sensitive to initial labeled documents)
What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
Work Plan • Collect Datasets • Codeword Assignment - 2 weeks • Learning Decoding – 1-2 weeks • Using Unlabeled Data - 2 weeks • Design Codes - 2 weeks • Project Write-up – 1 week
Summary • Use ECOC for efficient text classification with a large number of categories • Reduce code length without sacrificing performance • Fix code length and Increase Performance • Generalize to domain-independent classification tasks involving a large number of categories