1 / 32

In Defense of One-Vs-All Classification

In Defense of One-Vs-All Classification. Ryan Rifkin and Aldebaro Klautau Journal of Machine Learning Research, Volume 5,  (December 2004), Pages: 101 – 141. Presented by Shuiwang Ji Machine Learning Lab at CSE Center for Evolutionary Functional Genomics The Biodesign Institute

laurel
Download Presentation

In Defense of One-Vs-All Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In Defense of One-Vs-All Classification Ryan Rifkin and Aldebaro Klautau Journal of Machine Learning Research, Volume 5,  (December 2004), Pages: 101 – 141. Presented by Shuiwang Ji Machine Learning Lab at CSE Center for Evolutionary Functional Genomics The Biodesign Institute Part of this slides are taken from: http://www.mit.edu/~9.520/Classes/class08.html

  2. Main thesis • “One-against-rest scheme is extremely powerful, producing results that are often at least as accurate as other methods.” • “Experimental evidence of the superiority of the proposed methods over a simple One-against-rest scheme is improperly controlled or measured.”

  3. Outline • Single machine approaches; • Error correcting code approaches; • Tree-structured approaches (NOT in the paper); • Experiments.

  4. Watson & Watkins (WW)(1998) • Binary-class: Learn one function. Penalize each machine separately based on the margin violations; • Multi-class: Pay a penalty based on the relative values output by the machines.

  5. Watson & Watkins (WW)(1998) • Learn N functions. If a point x is in class i, make (k-1)*n

  6. Watson & Watkins (WW)(1998) • Too many constraints and slack variables (k-1)*n; • Not easy to decompose (not scalable); • Experimental setup is problematic.

  7. Crammer & Singer (2001) • Watson & Watkins: paying each class for which • Crammer & Singer: Penalize for the largest

  8. Crammer & Singer (2001) • Watson & Watkins: • Crammer & Singer n(k-1) n

  9. Crammer & Singer (2001) • Fewer slacks (compared to Watson & Watkins ); • Can be decomposed (more scalable); • Many tricks are developed and implemented for efficient training; • C and R source codes available: http://www.cis.upenn.edu/~crammer/code/MCSVM/MCSVM_1_0.tar.gz R (http://www.r-project.org/ ) kernlab package

  10. Lee, Lin, and Wahba (2001)

  11. Lee, Lin, Wahba, Analysis • Like the WW formulation, this formulation is big, and no decomposition method is provided; • This is an asymptotic analysis. It requires and and no rates are provided. But asymptotically, density estimation will allow us to recover the optimal Bayes rule.

  12. Outline • Single machine approaches; • Error correcting code approaches; • Tree-structured approaches; • Experiments.

  13. Codeword Meta-classifier Error-Correcting Code (ECC) Dietterich & Bakiri (1995) 0 1 0 0 0 0 0 0 0 0 Source: Dietterich and Bakiri (1995)

  14. One-against-rest

  15. One-against-one

  16. Special cases of ECC Source: http://www-cse.ucsd.edu/users/elkan/254spring01/aldebaro1.pdf

  17. Outline • Single machine approaches; • Error correcting code approaches; • Tree-structured approaches; • Experiments.

  18. Large Margin Directed Acyclic Graph (DAG) • Identical to one-against-one at training time; • At test time, DAG is used to determine which classifiers to test on a given point; • Classes i and j are compared, whichever class achieves lower score is removed from further consideration; • Repeat N-1 time, only one class remained.

  19. Large Margin DAGs for Multiclass Classification Source: Platt et al. (2000)

  20. Margin tree (Tibshirani and Hastie 2006) • SVM is constructed for each pair of classes to compute pair-wise margins; • Agglomerative clustering uses the pair-wise margins as distances to construct the hierarchical structure bottom up. • Three approaches: Greedy, single linkage, and complete linkage.

  21. Margin tree Source: Tibshirani and Hastie (2006)

  22. Outline • Single machine approaches; • Error correcting code approaches; • Tree-structured approaches; • Experiments (Compare five ECC approaches).

  23. Observations • In nearly all cases, the results of compared methods are very close; • In majority of experiments, 0 is in the confidence interval, meaning the classifiers are not statistically different;

  24. Implementations in R • e1071: one-against-one (LIBSVM) • kernlab: one-against-one, Crammer & Singer, Weston & Watkins • klaR: one-against-rest (SVMlight) • marginTree: http://www-stat.stanford.edu/~tibs/marginTree_1.00.zip

  25. Q & A Thank you!

More Related