1 / 26

Ensemble Classification Methods

Ensemble Classification Methods. Rayid Ghani. IR Seminar – 9/26/00. What is Ensemble Classification?. Set of Classifiers Decisions combined in ”some” way Often more accurate than the individual classifiers What properties should the base learners have?. Why should it work?.

mreynolds
Download Presentation

Ensemble Classification Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

  2. What is Ensemble Classification? • Set of Classifiers • Decisions combined in ”some” way • Often more accurate than the individual classifiers • What properties should the base learners have?

  3. Why should it work? • More accurate ONLY if the individual classifiers disagree • Error rate < 0.5 and errors are independent • Error rate is highly correlated with the correlations of the errors made by the different learners (Ali & Pazzani)

  4. Averaging Fails! • Use Delta-functions as classifiers (predict +1 at a point and –1 everywhere else) • For training sample size m, construct a set of at most 2m classifiers s.t. the majority vote is always correct • Associate 1 delta function with every example • Add M+ (# of +ve examples) copies of the function that predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere • Applying boosting to this results in zero training error but bad generalizations • Applying the margin analysis results in zero training error but margin is small O(1/m)

  5. Ideas? • Subsampling training examples • Bagging , Cross-Validated Committees, Boosting • Manipulating input features • Choose different features • Manipulating output targets • ECOC and variants • Injecting randomness • NN(different initial weights), DT(pick different splits), injecting noise, MCMC

  6. Combining Classifiers • Unweighted Voting • Bagging, ECOC etc. • Weighted Voting • Weight  accuracy (training or holdout set), LSR (weights  1/variance) • Bayesian model averaging

  7. BMA • All possible models in the model space used weighted by their probability of being the “Correct” model • Optimal given the correct model space and priors • Not widely used even though it was said not to overfit (Buntine, 1990)

  8. BMA - Equations prior likelihood noise model

  9. Equations • Posterior • Uniform Noise Model • Pure classification model • Model space too large – approximation required • Model with highest posterior, Sampling

  10. BMA of Bagged C4.5 Rules • Bagging as a form of importance sampling where all samples are weighed equally • Experimental Results • Every version of BMA performed worse than bagging on 19 out of 26 datasets • Posteriors skewed – dominated by a single rule model – model selection rather than averaging

  11. BMA of various learners • RISE Rule sets with partitioning • 8 databases from UCI • BMA worse than RISE in every domain • Trading Rules • Intuition (there is no single right rule so BMA should help) • BMA similar to choosing the single best rule

  12. Overfitting in BMA • Issue of overfitting is usually ignored (Freund et al. 2000) • Is overfitting the explanation for the poor performance of BMA? • Preferring a hypothesis that does not truly have the lowest error of any hypothesis considered, but by chance has the lowest error on training data. • Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered

  13. To BMA or not to BMA? • Net effect will depend on which effect prevails? • Increased overfitting (small if few models are considered) • Reduction in error obtained by giving some weight to alternative models (skewed weights => small effect) • Ali & Pazzani (1996) report good results but bagging wasn’t tried • Domingos (2000) used bootstrapping before BMA so the models were built from less data

  14. Why they work? • Bias / Variance Decomposition • Training data insufficient for choosing a single best classifier • Learning algorithms not “smart” enough! • Hypothesis space may not contain the true function

  15. Definitions • Bias is the persistent/systematic error of a learner independent of the training set. Zero for a learner that always makes the optimal prediction • Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set

  16. Bias–Variance Decomposition • Kong & Dietterich (1995) – variance can be negative and noise is ignored • Breiman (1996) – undefined for any given example and variance can be zero even when the learners predictions fluctuate • Tibshirani (1996) • Hastie (1997) • Kohavi & Wolpert (1996) allows the bias of the Bayes optimal classifier to be non-zero • Friedman (1997) leaves bias and variance for zero-one loss undefined

  17. Domingos (2000) • Single definition of bias and variance • Applicable to “any” loss function • Explains the margin effect (Schapire et al. 1997) using the decomposition • Incorporates variable misclassification costs • Experimental study

  18. Unified Decomposition • Loss functions • Squared L(t,y)=(t-y)2 • Absolute L(t,y)=|t-y| • Zero-One L(t,y)=0 if y=t else 1 • Goal = Minimize average L(t,y) over all weighted examples c1N(x) + B(x) + c2V(x)

  19. Properties of the unified decomposition • Relation to Order-correct learner • Relation to Margin of a learner • Maximizing margins is a combination of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.

  20. Experimental Study • 30 UCI datasets • Methodology • 100 bootstrap samples – averaged over the test set with uniform weights • Estimate bias, variance, zero-one loss • DT, kNN, boosting

  21. Boosting C4.5 - Results • Decreases both bias and variance • Bulk of bias reduction happens in the first few rounds • Variance reduction is more gradual and the dominant effect

  22. kNN results • kNN bias increases with k dominates variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.

  23. Issues • Does not work with “Any” loss function e.g. absolute loss • Decomposition is not purely additive unlike the original one for squared-loss

  24. Spectrum of ensembles Overfitting Boosting Bagging BMA Asymmetry of weights

  25. Open Issues concerning ensembles • Best way to construct ensembles? • No extensive comparison done • Computationally expensive • Not easily comprehensible

  26. Bibliography • Overview • T. Dietterich • Bauer & Kohavi • Averaging • Domingos • Freund, Mansour, Schapire • Ali, Pazzani • Bias – Variance Decomposition • Kohavi & Wolpert • Domingos • Friedman • Kong & Dietterich

More Related