1 / 18

Efficient Text Categorization with a Large Number of Categories

Efficient Text Categorization with a Large Number of Categories. Rayid Ghani KDD Project Proposal. Domains: Topics Genres Languages. $$$ Making. Numerous Applications Search Engines/Portals Customer Service Email Routing …. Text Categorization.

gaille
Download Presentation

Efficient Text Categorization with a Large Number of Categories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

  2. Domains: • Topics • Genres • Languages $$$Making • Numerous Applications • Search Engines/Portals • Customer Service • Email Routing …. Text Categorization

  3. How do people deal with a large number of classes? • Use fast multiclass algorithms (Naïve Bayes) • Builds one model per class • Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems • What happens with a 1000 class problem? • Can we do better?

  4. ECOC to the Rescue! • An n-class problem can be solved by solving log2n binary problems • More efficient than one-per-class • Does it actually perform better?

  5. What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems

  6. Testing ECOC Training ECOC f1 f2 f3 f4 f5 00 1 1 0 10 1 0 0 01110 01 001 A B C D 11 110 X

  7. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

  8. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

  9. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

  10. ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D X 1 1 110

  11. This Proposal Efficiency NB Preliminary Results ECOC (as used in Berger 99) Classification Performance Preliminary Results: ECOC reduces the error of the Naïve Bayes Classifier by 66% with NO increase in computational cost

  12. Proposed Solutions • Design codewords that minimize cost and maximize “performance” • Investigate the assignment of codewords to classes • Learn the decoding function • Incorporate unlabeled data into ECOC

  13. Use unlabeled data with a large number of classes • How? • Use EM • Mixed Results • Think Again! • Use Co-Training • Disastrous Results • Think one more time

  14. Use Unlabeled data • Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories • ECOC works great with a large number of classes but there is no framework for using unlabeled data

  15. Use Unlabeled Data • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training • Preliminary Results: Not so great! (very sensitive to initial labeled documents)

  16. What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

  17. Work Plan • Collect Datasets  • Codeword Assignment - 2 weeks • Learning Decoding – 1-2 weeks • Using Unlabeled Data - 2 weeks • Design Codes - 2 weeks • Project Write-up – 1 week

  18. Summary • Use ECOC for efficient text classification with a large number of categories • Reduce code length without sacrificing performance • Fix code length and Increase Performance • Generalize to domain-independent classification tasks involving a large number of categories

More Related