1 / 30

Combining labeled and unlabeled data for text categorization with a large number of categories

Combining labeled and unlabeled data for text categorization with a large number of categories. Rayid Ghani KDD Lab Project. Supervised Learning with Labeled Data. Labeled data is required in large quantities and can be very expensive to collect. Why use Unlabeled data?.

Download Presentation

Combining labeled and unlabeled data for text categorization with a large number of categories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project

  2. Supervised Learning with Labeled Data Labeled data is required in large quantities and can be very expensive to collect.

  3. Why use Unlabeled data? • Very Cheap in the case of text • Web Pages • Newsgroups • Email Messages • May not be equally useful as labeled data but is available in enormous quantities

  4. Goal • Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories

  5. •ECOC very accurate and efficient for text categorization with a large number of classes •Co-Training useful for combining labeled and unlabeled data with a small number of classes

  6. Related research with unlabeled data • Using EM in a generative model (Nigam et al. 1999) • Transductive SVMs (Joachims 1999) • Co-Training type algorithms (Blum & Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)

  7. What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems

  8. Testing ECOC Training ECOC f1 f2 f3 f4 f5 00 1 1 0 10 1 0 0 01110 01 001 A B C D 11 110 X

  9. The Co-training algorithm [Blum & Mitchell 1998] • Loop (while unlabeled documents remain): • Build classifiers A and B • Use Naïve Bayes • Classify unlabeled documents with A & B • Use Naïve Bayes • Add most confident A predictions and most confident B predictions as labeled training examples

  10. Naïve Bayeson A Estimate labels Select most confident Learn from labeled data Naïve Bayeson B Estimate labels Select most confident Add to labeled data The Co-training Algorithm [Blum & Mitchell, 1998]

  11. One Intuition behind co-training • A and B are redundant • A features independent of B features • Co-training like learning with random classification noise • Most confident A prediction gives random B • Small misclassification error for A

  12. ECOC + CoTraining = ECoTrain • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

  13. SPORTS SCIENCE ARTS POLITICS HEALTH LAW

  14. What happens with sparse data?

  15. Datasets • Hoovers-255 • Collection of 4285 corporate websites • Each company is classified into one of 255 categories • Baseline 2% • Jobs-65 (from WhizBang) • Job Postings (Two feature sets – Title, Description) • 65 categories • Baseline 11%

  16. Dataset Naïve Bayes (No UnLabeled Data) ECOC(No UnLabeled Data) EM Co-Training ECOC + Co-Training 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled 10% Labeled 10% Labeled Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5 Hoovers-255 15.2 32.0 24.8 36.5 9.1 10.2 27.6 Results

  17. Results

  18. What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

  19. Summary

  20. The Co-training setting Tom Mitchell Fredkin Professor of AI… …My advisor… …Professor Blum… Avrim Blum My research interests are… …My grandson… JohnnyI like horsies! Classifier A Classifier B

  21. Learning from Labeled and Unlabeled Data:Using Feature Splits • Co-training [Blum & Mitchell 98] • Meta-bootstrapping [Riloff & Jones 99] • coBoost [Collins & Singer 99] • Unsupervised WSD [Yarowsky 95] • Consider this the co-training setting

  22. Learning from Labeled and Unlabeled Data:Extending supervised learning • MaxEnt Discrimination [Jaakkola et al. 99] • Expectation Maximization [Nigam et al. 98] • Transductive SVMs [Joachims 99]

  23. Using Unlabeled Data with EM Estimate labels of the unlabeled documents Naïve Bayes Use all documents to build anew naïve Bayes classifier

  24. Co-training Uses feature split Incremental labeling Hard labels EM Ignores feature split Iterative labeling Probabilistic labels Co-training vs. EM Which differences matter?

  25. Hybrids of Co-training and EM Uses Feature Split? Labeling Label all Learn from all Label all Naïve Bayeson A Naïve Bayeson B Naïve Bayeson A & B Learn from all Label all Add only best

  26. Learning from Unlabeled Datausing Feature Splits • coBoost [Collins & Singer 99] • Meta-bootstrapping [Riloff & Jones 99] • Unsupervised WSD [Yarowsky 95] • Co-training [Blum & Mitchell 98]

  27. Intuition behind Co-training • A and B are redundant • A features independent of B features • Co-training like learning with random classification noise • Most confident A prediction gives random B • Small misclassification error for A

  28. Using Unlabeled Data with EM [Nigam, McCallum, Thrun & Mitchell, 1998] Estimate labels of unlabeled documents Initially learn from labeled only Naïve Bayes Use all documents to rebuild naïve Bayes classifier

  29. Yes No Label Few co-training Label All co-EM EM Co-EM Build naïve Bayes with all data Estimate labels Initialize withlabeled data Naïve Bayeson A Naïve Bayeson B Build naïve Bayes with all data Estimate labels Use Feature Split?

More Related