Efficient Text Categorization with a Large Number of Categories

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

Domains: • Topics • Genres • Languages $$$Making • Numerous Applications • Search Engines/Portals • Customer Service • Email Routing …. Text Categorization

How do people deal with a large number of classes? • Use fast multiclass algorithms (Naïve Bayes) • Builds one model per class • Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems • What happens with a 1000 class problem? • Can we do better?

ECOC to the Rescue! • An n-class problem can be solved by solving log2n binary problems • More efficient than one-per-class • Does it actually perform better?

What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems

Testing ECOC Training ECOC f1 f2 f3 f4 f5 00 1 1 0 10 1 0 0 01110 01 001 A B C D 11 110 X

ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D

ECOC - Picture f1 f2 f3 f4 f5 A B C D 00 1 1 0 10 1 0 0 01110 01 001 A B C D X 1 1 110

This Proposal Efficiency NB Preliminary Results ECOC (as used in Berger 99) Classification Performance Preliminary Results: ECOC reduces the error of the Naïve Bayes Classifier by 66% with NO increase in computational cost

Proposed Solutions • Design codewords that minimize cost and maximize “performance” • Investigate the assignment of codewords to classes • Learn the decoding function • Incorporate unlabeled data into ECOC

Use unlabeled data with a large number of classes • How? • Use EM • Mixed Results • Think Again! • Use Co-Training • Disastrous Results • Think one more time

Use Unlabeled data • Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories • ECOC works great with a large number of classes but there is no framework for using unlabeled data

Use Unlabeled Data • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training • Preliminary Results: Not so great! (very sensitive to initial labeled documents)

What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Work Plan • Collect Datasets  • Codeword Assignment - 2 weeks • Learning Decoding – 1-2 weeks • Using Unlabeled Data - 2 weeks • Design Codes - 2 weeks • Project Write-up – 1 week

Summary • Use ECOC for efficient text classification with a large number of categories • Reduce code length without sacrificing performance • Fix code length and Increase Performance • Generalize to domain-independent classification tasks involving a large number of categories

Efficient Text Categorization with a Large Number of Categories

Efficient Text Categorization with a Large Number of Categories

Presentation Transcript

Text Categorization

Text Categorization

Text Categorization (TC)

From Large Scale Image Categorization to Entry-Level Categories

Text Categorization

Text Categorization

Text Categorization

Text Categorization

text categorization

Statistical Text Categorization

Text Categorization

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Text Categorization

Combining labeled and unlabeled data for text categorization with a large number of categories

A Survey on Text Categorization with Machine Learning

Text Categorization

A Study of Text Categorization

Text Categorization

Text Categorization

Text Categorization (continued)