1 / 19

Cross-training: Learning probabilistic relations between taxonomies

Cross-training: Learning probabilistic relations between taxonomies. Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay. Document classification. Set of labels A Bookmark folders, Yahoo topics Training documents, each with one A label Supervised approach

sally
Download Presentation

Cross-training: Learning probabilistic relations between taxonomies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-training: Learningprobabilistic relationsbetween taxonomies Sunita SarawagiSoumen ChakrabartiShantanu Godbole IIT Bombay

  2. Document classification • Set of labels A • Bookmark folders, Yahoo topics • Training documents, each with one A label • Supervised approach • Use training docs to induce classifier • Invoke classifier on each unlabeled document in isolation • Semi-supervised approach • Unlabeled documents available during training • Nigam et al. show how to exploit them collectively KDD2003

  3. Cross-training from another taxonomy • Another set of labels B, partly related but not identical to A • A=Dmoz topics, B=Yahoo topics • A=Personal bookmark topics, B=Yahoo topics • Training docs come in two flavors now • Fully labeled with A and B labels (rare) • Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? • Inductive transfer, multi-task learning DA DB KDD2003

  4. Motivation • Symmetric taxonomy mapping • Ecommerce catalogs: A=distributor, B=retailer • Web directories: A = Dmoz, B = Yahoo • Incomplete taxonomies, small training sets • Bookmark taxonomy vs. Yahoo • Cartesian label spaces Region Top Label-pair-conditionedterm distribution Sports Regional Topic … Baseball Cricket UK USA KDD2003

  5. Obvious approach: Labels as features • A-label known, estimate B-label • Suppose we have A+B labeled training set • Discrete valued “label column”  • Multinomial naïve Bayes too biased, cannot balance heterogeneous features • Do not have fully-labeled data • Must guess  (use soft scores instead of 0/1) Term feature values   Augmented feature vector Target label KDD2003

  6. DA–DB Docs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator) Docs having only B-labels Train Test DB–DA S(A,0) Label Testoutput Text features   t  |A| Test case withA-label known(coded using avector of +1 and –1) Train S(B,1) One-vs-rest SVMensemble for B(target label set) Term features –1,…,–1,+1,–1,… SVM-CT: Cross-trained SVM S(A,1)S(B,2)S(A,2)… KDD2003

  7. SVM-CT anecdotes • Discriminant reveals relations between A and B • One-to-one, many-to-one, related, antagonistic • However, accuracy gains are meager Positive Negative KDD2003

  8. EM1D: Info from unlabeled docs • Use training docs to induce initial classifier for taxonomy B, say • Repeat until classifier satisfactory • Estimate Pr(|d) for unlabeled doc d, B • Reweigh d by factor Pr(|d) and add to training set for label  • Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) • Ignores labels from another taxonomy A KDD2003

  9. DB–DA: docswith B-labels Docs in DA–DBlabeled ’ Stratified EM1D • Target labels = B • B-labeled docs are labeled training instances • Consider A-labeled docs labeled  • These are unlabeled for taxonomy B • Run EM1D for each row  • Test instance has  known • Invoke semi-supervised model for row  to classify • EM2D minus 2D model interaction B-topics … Docs in DA–DBlabeled  A topics KDD2003

  10. EM2D: Cartesian product EM • Initialize with fully labeled docs which go to a specific (,) cell • Smear training doc across label row or column • Uniform smear could be bad • Use a naïve Bayes classifier to seed • Parameters extended from EM1D • , prior probability for label pair (,) • ,,tmultinomial term probability for (,) A-labeled doc B-labeled doc Labels in B Labels in A KDD2003

  11. Updatedclass-pairpriors Updatedclass-pair-conditionedterm stats EM2D updates • E-step for an A-labeled document • M-step KDD2003

  12. Applying EM2D to a test doc • Mapping a B-labeled test doc d to an A label (e-commerce catalogs) • Given , find argmax Pr(,|d) • Classifying a document d with no labels to an A label • Aggregation • For each  compute  Pr(,|d), pick best  • Guessing (EM2D-G) • Guess the best * using a B-classifier • Find argmax Pr(,*|d) • EM pitfalls: damping factor, early stopping KDD2003

  13. Experiments • Selected 5 Dmoz and Yahoo subtree pairs • Compare EM2D against • Naïve Bayes, best #features and smoothing • EM1D: ignore labels from other taxonomy, consider as unlabeled docs • Stratified EM1D • Mapping test doc with A-label to B-label or vice versa • Classifying zero-labeled test doc • Accuracy = fraction with correct labels KDD2003

  14. Accuracy benefits in mapping Improvementover NB: 30% best,10% average • EM1D and NB are close, because training set sizes for each taxonomy are not too small • EM2D > Stratified EM1D > NB • 2d transfer of model info seems important KDD2003

  15. Asymmetric setting • Few (only 300) bookmarked URLs(taxonomy B, target) • Many Yahoo URLs, larger number of classes (taxonomy A) • Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew KDD2003

  16. Zero-labeled test documents • EM1D improves accuracy only for 12 train docs • EM2D with guessing improves beyond EM1D • In fact, better than aggregating scores to 1d • Choice of unlabeled:labeled damping ratio L may be important to get benefits KDD2003

  17. Robustness to initialization NaïveBayessmear Uniformsmear • Seeding choices: hard (best class), NB scores, uniform • Smear a fraction uniformly, rest by NB scores • EM2D is robust to wide range of smear fractions • Fully uniform smearing can fail (local optima) KDD2003

  18. Related work • Multi-task learning, “life-long learning”, inductive transfer (Thrun, Caruana) • Find earlier learning tasks similar to current • Reuse models, features, parameters • Co-training (Blum, Mitchell) • Two learners over a single label set • Partitioned feature set • Catalog mapping (Agrawal, Srikant) • Two-label docs to estimate priors • Raise prior to exponent, tune by validation • EM2D: generative model, slightly better accuracy KDD2003

  19. Summary and future work • Two algorithms for cross-training • EM-based semi-supervised algorithm EM2D • SVM-based algorithm SVM-CT • Benefits • Improved accuracy • Interpretable mappings between label sets • General issue: how best to deal with a large number of heterogeneous attributes? • Future work • Brittle naïve Bayes scores in EM2D • Small relative gains in SVM-CT:better kernels? feature selection? KDD2003

More Related