1 / 54

Chapter 7: Ensemble Methods

Chapter 7: Ensemble Methods. Ensemble Methods. Rationale Combining classifiers Bagging Boosting Ada-Boosting. Rationale. In any application, we can use several learning algorithms; hyperparameters affect the final learner

ahanu
Download Presentation

Chapter 7: Ensemble Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7: Ensemble Methods

  2. Ensemble Methods • Rationale • Combining classifiers • Bagging • Boosting • Ada-Boosting

  3. Rationale • In any application, we can use several learning algorithms; hyperparameters affect the final learner • The No Free Lunch Theorem: no single learning algorithm in any domains always induces the most accurate learner • Try many and choose the one with the best cross-validation results

  4. Rationale • On the other hand … • Each learning model comes with a set of assumption and thus bias • Learning is an ill-posed problem (finite data): each model converges to a different solution and fails under different circumstances • Why do not we combine multiple learners intelligently, which may lead to improved results?

  5. Rationale • How about combining learners that always make similar decisions? • Advantages? • Disadvantages? • Complementary? • To build ensemble: Your suggestions? • Diff L • Same L, diff P • D • F

  6. Rationale • Why it works? • Suppose there are 25 base classifiers • Each classifier has error rate,  = 0.35 • If the base classifiers are identical, then the ensemble will misclassify the same examples predicted incorrectly by the base classifiers. • Assume classifiers are independent, i.e., their errors are uncorrelated. Then the ensemble makes a wrong prediction only if more than half of the base classifiers predict incorrectly. • Probability that the ensemble classifier makes a wrong prediction:

  7. Works if … • The base classifiers should be independent. • The base classifiers should do better than a classifier that performs random guessing. (error < 0.5) • In practice, it is hard to have base classifiers perfectly independent. Nevertheless, improvements have been observed in ensemble methods when they are slightly correlated.

  8. Rationale • One important note is that: • When we generate multiple base-learners, we want them to be reasonably accurate but do not require them to be very accurate individually, so they are not, and need not be, optimized separately for best accuracy. The base learners are not chosen for their accuracy, but for their simplicity.

  9. Ensemble Methods • Rationale • Combining classifiers • Bagging • Boosting • Ada-Boosting

  10. Combining classifiers • Examples: classification trees and neural networks, several neural networks, several classification trees, etc. • Average results from different models • Why? • Better classification performance than individual classifiers • More resilience to noise • Why not? • Time consuming • Overfitting

  11. Why • Why? • Better classification performance than individual classifiers • More resilience to noise • Beside avoiding the selection of the worse classifier under particular hypothesis, fusion of multiple classifiers can improve the performance of the best individual classifiers • This is possible if individual classifiers make “different” errors • For linear combiners, Turner and Ghosh (1996) showed that averaging outputs of individual classifiers with unbiased and uncorrelated errors can improve the performance of the best individual classifier and, for infinite number of classifiers, provide the optimal Bayes classifier

  12. Different classifier

  13. Architecture serial parallel hybrid

  14. Architecture

  15. Architecture

  16. Classifiers Fusion • Fusion is useful only if the combined classifiers are mutually complementary • Majority vote fuser: the majority should be always correct

  17. Complementary classifiers • Several approaches have been proposed to construct ensembles made up of complementary classifiers. Among the others: • Using problem and designer knowledge • Injecting randomness • Varying the classifier type, architecture, or parameters • Manipulating training data • Manipulating features

  18. If you are interested … • L. Xu, A. Kryzak, C. V. Suen, “Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition”, IEEE Transactions on Systems, Man Cybernet, 22(3), 1992, pp. 418-435. • J. Kittle, M. Hatef, R. Duin and J. Matas, “On Combining Classifiers”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), March 1998, pp. 226-239. • D. Tax, M. Breukelen, R. Duin, J. Kittle, “Combining Multiple Classifiers by Averaging or by Multiplying?”, Patter Recognition, 33(2000), pp. 1475-1485. • L. I. Kuncheva, “A Theoretical Study on Six Classifier Fusion Strategies”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 2002, pp. 281-286.

  19. Alternatively … • Instead of designing multiple classifiers with the same dataset, we can manipulate the training set: multiple training sets are created by resampling the original data according to some distribution. E.g., bagging and boosting

  20. Ensemble Methods • Rationale • Combining classifiers • Bagging • Boosting • Ada-Boosting

  21. Bagging • Breiman, 1996 • Derived from bootstrap (Efron, 1993) • Create classifiers using training sets that are bootstrapped (drawn with replacement) • Average results for each case

  22. Bagging Example

  23. Bagging • Sampling (with replacement) according to a uniform probability distribution • Each bootstrap sample D has the same size as the original data. • Some instances could appear several times in the same training set, while others may be omitted. • Build classifier on each bootstrap sample D • D will contain approximately 63% of the original data. • Each data object has probability 1- (1 – 1/n)n of being selected in D

  24. Bagging • Bagging improves generalization performance by reducing variance of the base classifiers. The performance of bagging depends on the stability of the base classifier. • If a base classifier is unstable, bagging helps to reduce the errors associated with random fluctuations in the training data. • If a base classifier is stable, bagging may not be able to improve, rather it could degrade the performance. • Bagging is less susceptible to model overfitting when applied to noisy data.

  25. Boosting • Sequential production of classifiers • Each classifier is dependent on the previous one, and focuses on the previous one’s errors • Examples that are incorrectly predicted in previous classifiers are chosen more often or weighted more heavily

  26. Boosting • Records that are wrongly classified will have their weights increased • Records that are classified correctly will have their weights decreased • Boosting algorithms differ in terms of (1) how the weights of the training examples are updated at the end of each round, and (2) how the predictions made by each classifier are combined. • Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

  27. Ada-Boosting • Freund and Schapire, 1997 • Ideas • Complex hypotheses tend to overfitting • Simple hypotheses may not explain data well • Combine many simple hypotheses into a complex one • Ways to design simple ones, and combination issues

  28. Ada-Boosting • Two approaches • Select examples according to error in previous classifier (more representatives of misclassified cases are selected) – more common • Weigh errors of the misclassified cases higher (all cases are incorporated, but weights are different) – does not work for some algorithms

  29. Boosting Example

  30. Ada-Boosting • Input: • Training samples S = {(xi, yi)}, i = 1, 2, …, N • Weak learner h • Initialization • Each sample has equal weight wi = 1/N • For k = 1 … T • Train weak learner hk according to weighted sample sets • Compute classification errors • Update sample weights wi • Output • Final model which is a linear combination of hk

  31. Ada-Boosting

  32. Ada-Boosting

  33. Ada-Boosting

  34. Ada-Boosting

  35. Ada-Boosting

  36. Ada-Boosting

  37. Some Details • Weak learner: error rate is only slightly better than random guessing • Boosting: sequentially apply the weak learner to repeated modified version of the data, thereby producing a sequence of weak classifiers h(x). The prediction from all of the weak classifiers are combined through a weighted majority vote • H(x) = sign[sum(aihi(x))]

  38. Schematic of AdaBoost Training Samples h1(x) Weighted Samples h2(x) Sign[sum] Weighted Samples h3(x) Weighted Samples hT(x)

  39. AdaBoost • For k = 1 to T • Fit a learner to the training data using weights wi • Compute • Set wi

  40. AdaBoost • It penalizes models that have poor accuracy • If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated • because of its tendency to focus on training examples that are wrongly classified, the boosting technique can be quite susceptible to overfitting.

  41. AdaBoost • Classification • AdaBoost.M1 (two-class problem) • AdaBoost.M2 (multiple-class problem) • Regression • AdaBoostR

  42. Who is doing better? • Popular Ensemble Methods: An Empirical Study by David Opitz and Richard Maclin • Present a comprehensive evaluation of both bagging and boosting on 23 datasets using decision trees and NNs

  43. Classifier Ensemble • Neural networks are the basic classification method • An effective combining scheme is to simply average the predictions of the network • An ideal assemble consists of highly correct classifiers that disagree as much as possible

  44. Bagging vs. Boosting

  45. Single NN; 2. ensemble; 3. bagging; 4. arcing; 5. ada; • 6. decision tree; 7. bagging of decision trees; 8. arcing; 9. ada - dt

  46. Neural Networks Reduction in error for Ada-boosting, arcing, and bagging of NN as a percentage of the original error rate as well as standard deviation • Ada-Boosting • Arcing • Bagging • White bar represents 1 • standard deviation

  47. Decision Trees

  48. Composite Error Rates

  49. Neural Networks: Bagging vs Simple

  50. Ada-Boost: Neural Networks vs. Decision Trees • NN • DT • Box represents • reduction in error

More Related