260 likes | 291 Views
Ensemble Classification Methods. Rayid Ghani. IR Seminar – 9/26/00. What is Ensemble Classification?. Set of Classifiers Decisions combined in ”some” way Often more accurate than the individual classifiers What properties should the base learners have?. Why should it work?.
E N D
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00
What is Ensemble Classification? • Set of Classifiers • Decisions combined in ”some” way • Often more accurate than the individual classifiers • What properties should the base learners have?
Why should it work? • More accurate ONLY if the individual classifiers disagree • Error rate < 0.5 and errors are independent • Error rate is highly correlated with the correlations of the errors made by the different learners (Ali & Pazzani)
Averaging Fails! • Use Delta-functions as classifiers (predict +1 at a point and –1 everywhere else) • For training sample size m, construct a set of at most 2m classifiers s.t. the majority vote is always correct • Associate 1 delta function with every example • Add M+ (# of +ve examples) copies of the function that predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere • Applying boosting to this results in zero training error but bad generalizations • Applying the margin analysis results in zero training error but margin is small O(1/m)
Ideas? • Subsampling training examples • Bagging , Cross-Validated Committees, Boosting • Manipulating input features • Choose different features • Manipulating output targets • ECOC and variants • Injecting randomness • NN(different initial weights), DT(pick different splits), injecting noise, MCMC
Combining Classifiers • Unweighted Voting • Bagging, ECOC etc. • Weighted Voting • Weight accuracy (training or holdout set), LSR (weights 1/variance) • Bayesian model averaging
BMA • All possible models in the model space used weighted by their probability of being the “Correct” model • Optimal given the correct model space and priors • Not widely used even though it was said not to overfit (Buntine, 1990)
BMA - Equations prior likelihood noise model
Equations • Posterior • Uniform Noise Model • Pure classification model • Model space too large – approximation required • Model with highest posterior, Sampling
BMA of Bagged C4.5 Rules • Bagging as a form of importance sampling where all samples are weighed equally • Experimental Results • Every version of BMA performed worse than bagging on 19 out of 26 datasets • Posteriors skewed – dominated by a single rule model – model selection rather than averaging
BMA of various learners • RISE Rule sets with partitioning • 8 databases from UCI • BMA worse than RISE in every domain • Trading Rules • Intuition (there is no single right rule so BMA should help) • BMA similar to choosing the single best rule
Overfitting in BMA • Issue of overfitting is usually ignored (Freund et al. 2000) • Is overfitting the explanation for the poor performance of BMA? • Preferring a hypothesis that does not truly have the lowest error of any hypothesis considered, but by chance has the lowest error on training data. • Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered
To BMA or not to BMA? • Net effect will depend on which effect prevails? • Increased overfitting (small if few models are considered) • Reduction in error obtained by giving some weight to alternative models (skewed weights => small effect) • Ali & Pazzani (1996) report good results but bagging wasn’t tried • Domingos (2000) used bootstrapping before BMA so the models were built from less data
Why they work? • Bias / Variance Decomposition • Training data insufficient for choosing a single best classifier • Learning algorithms not “smart” enough! • Hypothesis space may not contain the true function
Definitions • Bias is the persistent/systematic error of a learner independent of the training set. Zero for a learner that always makes the optimal prediction • Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set
Bias–Variance Decomposition • Kong & Dietterich (1995) – variance can be negative and noise is ignored • Breiman (1996) – undefined for any given example and variance can be zero even when the learners predictions fluctuate • Tibshirani (1996) • Hastie (1997) • Kohavi & Wolpert (1996) allows the bias of the Bayes optimal classifier to be non-zero • Friedman (1997) leaves bias and variance for zero-one loss undefined
Domingos (2000) • Single definition of bias and variance • Applicable to “any” loss function • Explains the margin effect (Schapire et al. 1997) using the decomposition • Incorporates variable misclassification costs • Experimental study
Unified Decomposition • Loss functions • Squared L(t,y)=(t-y)2 • Absolute L(t,y)=|t-y| • Zero-One L(t,y)=0 if y=t else 1 • Goal = Minimize average L(t,y) over all weighted examples c1N(x) + B(x) + c2V(x)
Properties of the unified decomposition • Relation to Order-correct learner • Relation to Margin of a learner • Maximizing margins is a combination of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.
Experimental Study • 30 UCI datasets • Methodology • 100 bootstrap samples – averaged over the test set with uniform weights • Estimate bias, variance, zero-one loss • DT, kNN, boosting
Boosting C4.5 - Results • Decreases both bias and variance • Bulk of bias reduction happens in the first few rounds • Variance reduction is more gradual and the dominant effect
kNN results • kNN bias increases with k dominates variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.
Issues • Does not work with “Any” loss function e.g. absolute loss • Decomposition is not purely additive unlike the original one for squared-loss
Spectrum of ensembles Overfitting Boosting Bagging BMA Asymmetry of weights
Open Issues concerning ensembles • Best way to construct ensembles? • No extensive comparison done • Computationally expensive • Not easily comprehensible
Bibliography • Overview • T. Dietterich • Bauer & Kohavi • Averaging • Domingos • Freund, Mansour, Schapire • Ali, Pazzani • Bias – Variance Decomposition • Kohavi & Wolpert • Domingos • Friedman • Kong & Dietterich