Cross-training: Learning probabilistic relations between taxonomies

Document classification. Set of labels A Bookmark folders, Yahoo topics Training documents, each with one A label Supervised approach

Cross-training: Learning probabilistic relations between taxonomies

  Cross-training: Learningprobabilistic relationsbetween taxonomies Sunita SarawagiSoumen ChakrabartiShantanu Godbole IIT Bombay

  2. Document classification • Set of labels A • Bookmark folders, Yahoo topics • Training documents, each with one A label • Supervised approach • Use training docs to induce classifier • Invoke classifier on each unlabeled document in isolation • Semi-supervised approach • Unlabeled documents available during training • Nigam et al. show how to exploit them collectively KDD2003

  3. Cross-training from another taxonomy • Another set of labels B, partly related but not identical to A • A=Dmoz topics, B=Yahoo topics • A=Personal bookmark topics, B=Yahoo topics • Training docs come in two flavors now • Fully labeled with A and B labels (rare) • Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? • Inductive transfer, multi-task learning DA DB KDD2003

  4. Motivation • Symmetric taxonomy mapping • Ecommerce catalogs: A=distributor, B=retailer • Web directories: A = Dmoz, B = Yahoo • Incomplete taxonomies, small training sets • Bookmark taxonomy vs. Yahoo • Cartesian label spaces Region Top Label-pair-conditionedterm distribution Sports Regional Topic … Baseball Cricket UK USA KDD2003

  5. Obvious approach: Labels as features • A-label known, estimate B-label • Suppose we have A+B labeled training set • Discrete valued “label column”  • Multinomial naïve Bayes too biased, cannot balance heterogeneous features • Do not have fully-labeled data • Must guess  (use soft scores instead of 0/1) Term feature values   Augmented feature vector Target label KDD2003

  6. DA–DB Docs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator) Docs having only B-labels Train Test DB–DA S(A,0) Label Testoutput Text features   t  |A| Test case withA-label known(coded using avector of +1 and –1) Train S(B,1) One-vs-rest SVMensemble for B(target label set) Term features –1,…,–1,+1,–1,… SVM-CT: Cross-trained SVM S(A,1)S(B,2)S(A,2)… KDD2003

  7. SVM-CT anecdotes • Discriminant reveals relations between A and B • One-to-one, many-to-one, related, antagonistic • However, accuracy gains are meager Positive Negative KDD2003

  8. EM1D: Info from unlabeled docs • Use training docs to induce initial classifier for taxonomy B, say • Repeat until classifier satisfactory • Estimate Pr(|d) for unlabeled doc d, B • Reweigh d by factor Pr(|d) and add to training set for label  • Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) • Ignores labels from another taxonomy A KDD2003

  9. DB–DA: docswith B-labels Docs in DA–DBlabeled ’ Stratified EM1D • Target labels = B • B-labeled docs are labeled training instances • Consider A-labeled docs labeled  • These are unlabeled for taxonomy B • Run EM1D for each row  • Test instance has  known • Invoke semi-supervised model for row  to classify • EM2D minus 2D model interaction B-topics … Docs in DA–DBlabeled  A topics KDD2003

  10. EM2D: Cartesian product EM • Initialize with fully labeled docs which go to a specific (,) cell • Smear training doc across label row or column • Uniform smear could be bad • Use a naïve Bayes classifier to seed • Parameters extended from EM1D • , prior probability for label pair (,) • ,,tmultinomial term probability for (,) A-labeled doc B-labeled doc Labels in B Labels in A KDD2003

  11. Updatedclass-pairpriors Updatedclass-pair-conditionedterm stats EM2D updates • E-step for an A-labeled document • M-step KDD2003

  12. Applying EM2D to a test doc • Mapping a B-labeled test doc d to an A label (e-commerce catalogs) • Given , find argmax Pr(,|d) • Classifying a document d with no labels to an A label • Aggregation • For each  compute  Pr(,|d), pick best  • Guessing (EM2D-G) • Guess the best * using a B-classifier • Find argmax Pr(,*|d) • EM pitfalls: damping factor, early stopping KDD2003

  13. Experiments • Selected 5 Dmoz and Yahoo subtree pairs • Compare EM2D against • Naïve Bayes, best #features and smoothing • EM1D: ignore labels from other taxonomy, consider as unlabeled docs • Stratified EM1D • Mapping test doc with A-label to B-label or vice versa • Classifying zero-labeled test doc • Accuracy = fraction with correct labels KDD2003

  14. Accuracy benefits in mapping Improvementover NB: 30% best,10% average • EM1D and NB are close, because training set sizes for each taxonomy are not too small • EM2D > Stratified EM1D > NB • 2d transfer of model info seems important KDD2003

  15. Asymmetric setting • Few (only 300) bookmarked URLs(taxonomy B, target) • Many Yahoo URLs, larger number of classes (taxonomy A) • Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew KDD2003

  16. Zero-labeled test documents • EM1D improves accuracy only for 12 train docs • EM2D with guessing improves beyond EM1D • In fact, better than aggregating scores to 1d • Choice of unlabeled:labeled damping ratio L may be important to get benefits KDD2003

  17. Robustness to initialization NaïveBayessmear Uniformsmear • Seeding choices: hard (best class), NB scores, uniform • Smear a fraction uniformly, rest by NB scores • EM2D is robust to wide range of smear fractions • Fully uniform smearing can fail (local optima) KDD2003

  18. Related work • Multi-task learning, “life-long learning”, inductive transfer (Thrun, Caruana) • Find earlier learning tasks similar to current • Reuse models, features, parameters • Co-training (Blum, Mitchell) • Two learners over a single label set • Partitioned feature set • Catalog mapping (Agrawal, Srikant) • Two-label docs to estimate priors • Raise prior to exponent, tune by validation • EM2D: generative model, slightly better accuracy KDD2003

  19. Summary and future work • Two algorithms for cross-training • EM-based semi-supervised algorithm EM2D • SVM-based algorithm SVM-CT • Benefits • Improved accuracy • Interpretable mappings between label sets • General issue: how best to deal with a large number of heterogeneous attributes? • Future work • Brittle naïve Bayes scores in EM2D • Small relative gains in SVM-CT:better kernels? feature selection? KDD2003

