1 / 31

Ensemble Learning

Ensemble Learning. Reading: R. Schapire , A brief introduction to boosting. Ensemble learning. Training sets Hypotheses Ensemble hypothesis S 1 h 1

pete
Download Presentation

Ensemble Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

  2. Ensemble learning Training sets Hypotheses Ensemble hypothesis S1h1 S2h2 . . . SN hN H

  3. Advantages of ensemble learning • Can be very effective at reducing generalization error! (E.g., by voting.) • Ideal case: the hi have independent errors

  4. Example Given three hypotheses, h1, h2, h3, with hi(x)  {−1,1} Suppose each hi has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h1, h2, and h3 . What is probability that H is correct?

  5. Another Example Again,given three hypotheses, h1, h2, h3. Suppose each hi has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h1, h2, and h3 . What is probability that the classification is correct?

  6. General case In general, if hypotheses h1, ..., hMall have generalization accuracy A, what is probability that a majority vote will be correct?

  7. Possible problems with ensemble learning • Errors are typically not independent • Training time and classification time are increased by a factor of M. • Hard to explain how ensemble hypothesis does classification. • How to get enough data to create M separate data sets, S1, ..., SM?

  8. Three popular methods: • Voting: • Train classifier on M different training sets Si to obtain M different classifiers hi. • For a new instance x, define H(x) as: where αi is a confidence measure for classifier hi

  9. Bagging (Breiman, 1990s): • To create Si, create “bootstrap replicates” of original training set S • Boosting (Schapire & Freund, 1990s) • To create Si, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

  10. Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

  11. Sketch of algorithm Given examples Sand learning algorithm L, with | S | = N • Initialize probability distribution over examples w1(i) = 1/N . • Repeatedly run L on training sets St Sto produce h1, h2, ... , hK. • At each step, derive St from S by choosing examples probabilistically according to probability distribution wt . Use St to learn ht. • At each step, derive wt+1 by giving more probability to examples that were misclassified at step t. • The final ensemble classifier H is a weighted sum of the ht’s, with each weight being a function of the corresponding ht’s error on its training set.

  12. Adaboost algorithm • Given S= {(x1, y1), ..., (xN, yN)} where x X,yi {+1, −1} • Initialize w1(i) = 1/N. (Uniform distribution over data)

  13. For t = 1, ..., K: • Select new training set St from S with replacement, according to wt • Train L on St to obtain hypothesis ht • Compute the training error tof ht on S : • Compute coefficient

  14. Compute new weights on data: For i = 1 to N where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:

  15. At the end of K iterations of this algorithm, we have h1, h2, . . . , hK We also have 1, 2, . . . ,K, where • Ensemble classifier: • Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

  16. A Hypothetical Example where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1 t = 1 : w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S1 = {x1, x2, x2, x5,x5,x6, x7, x8} (notice some repeats) Train classifier on S1 to get h1 Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} • Calculate error:

  17. Calculate ’s: Calculate new w’s:

  18. t = 2 • w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} • S2 = {x1, x2, x2, x3,x4,x4,x7, x8} • Run classifier on S2 to get h2 • Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} • Calculate error:

  19. Calculate ’s: Calculate w’s:

  20. t =3 • w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} • S3 = {x2, x3, x3,x3,x5,x6,x7, x8} • Run classifier on S3 to get h3 • Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} • Calculate error:

  21. Calculate ’s: • Ensemble classifier:

  22. Recall the training set: What is the accuracy of H on the training data? • where { x1, x2, x3, x4 } are class +1 • {x5, x6, x7, x8 } are class −1

  23. HW 2: SVMs and Boosting Caveat: I don’t know if boosting will help. But worth a try!

  24. Boosting implementation in HW 2 Use arrays and vectors. Let S be the training set. You can set S to DogsVsCats.train or some subset of examples. Let M = |S| be the number of training examples. Let K be the number of boosting iterations. ; Initialize: Read in S as an array of M rows and 65 columns. Create the weight vector w(1) = (1/M, 1/M, 1/M, ..., 1/M)

  25. ; Run Boosting Iterations: For T = 1 to K { ; Create training set T: For i = 1 to M { Select (with replacement) an example from S using weights in w. (Use “roulette wheel” selection algorithm) Print that example to S<T> (i.e., “S1”) ; Learn classifier T svm_learn <kernel parameters>S<T> h<T> ; Calculate Error<T> : svm_classify S h<T> <T>.predictions (e.g., “svm_classify S h1 1.predictions”) Use <T>.predictionsto determine which examples in S were misclassified. Error<T> = Sum of weights of examples that were misclassified.

  26. ; Calculate Alpha<T> ; Calculate new weight vector, using Alpha<T> and <T>.predictions file } ; T  T + 1 At this point, you have h1, h2, ..., hK [the K different classifiers] Alpha1, Alpha2, ..., AlphaK

  27. ; Run Ensemble Classifier on test data (DogsVsCats.test): For T = 1 to K: svm_classifyDogsVsCats.test h<T>.model Test(T).predictions } At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions To classify test example x1: The sign of row 1 of Test1.predictions gives h1(x1) The sign of row 1 of Test2.predictions gives h2(x1) etc. H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) + ... + AlphaK* hK(x1)] Compute H(x) for every x in the test set.

  28. Sources of Classifier Error: Bias, Variance, and Noise • Bias: • Classifier cannot learn the correct hypothesis (no matter what training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets. • Variance: • Training data is not representative enough of all data, so the learned classifier h varies from one training set to another. • Noise: • Training data contains errors, so incorrect hypothesis h is learned. Give examples of these.

  29. Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K. Optional: Read about “Margin-theory” explanation of success of Boosting

More Related