Ensemble Learning

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

Ensemble learning Training sets Hypotheses Ensemble hypothesis S1h1 S2h2 . . . SN hN H

Advantages of ensemble learning • Can be very effective at reducing generalization error! (E.g., by voting.) • Ideal case: the hi have independent errors

Example Given three hypotheses, h1, h2, h3, with hi(x)  {−1,1} Suppose each hi has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h1, h2, and h3 . What is probability that H is correct?

Another Example Again,given three hypotheses, h1, h2, h3. Suppose each hi has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h1, h2, and h3 . What is probability that the classification is correct?

General case In general, if hypotheses h1, ..., hMall have generalization accuracy A, what is probability that a majority vote will be correct?

Possible problems with ensemble learning • Errors are typically not independent • Training time and classification time are increased by a factor of M. • Hard to explain how ensemble hypothesis does classification. • How to get enough data to create M separate data sets, S1, ..., SM?

Three popular methods: • Voting: • Train classifier on M different training sets Si to obtain M different classifiers hi. • For a new instance x, define H(x) as: where αi is a confidence measure for classifier hi

Bagging (Breiman, 1990s): • To create Si, create “bootstrap replicates” of original training set S • Boosting (Schapire & Freund, 1990s) • To create Si, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

Sketch of algorithm Given examples Sand learning algorithm L, with | S | = N • Initialize probability distribution over examples w1(i) = 1/N . • Repeatedly run L on training sets St Sto produce h1, h2, ... , hK. • At each step, derive St from S by choosing examples probabilistically according to probability distribution wt . Use St to learn ht. • At each step, derive wt+1 by giving more probability to examples that were misclassified at step t. • The final ensemble classifier H is a weighted sum of the ht’s, with each weight being a function of the corresponding ht’s error on its training set.

Adaboost algorithm • Given S= {(x1, y1), ..., (xN, yN)} where x X,yi {+1, −1} • Initialize w1(i) = 1/N. (Uniform distribution over data)

For t = 1, ..., K: • Select new training set St from S with replacement, according to wt • Train L on St to obtain hypothesis ht • Compute the training error tof ht on S : • Compute coefficient

Compute new weights on data: For i = 1 to N where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:

At the end of K iterations of this algorithm, we have h1, h2, . . . , hK We also have 1, 2, . . . ,K, where • Ensemble classifier: • Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

A Hypothetical Example where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1 t = 1 : w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S1 = {x1, x2, x2, x5,x5,x6, x7, x8} (notice some repeats) Train classifier on S1 to get h1 Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} • Calculate error:

Calculate ’s: Calculate new w’s:

t = 2 • w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} • S2 = {x1, x2, x2, x3,x4,x4,x7, x8} • Run classifier on S2 to get h2 • Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} • Calculate error:

Calculate ’s: Calculate w’s:

t =3 • w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} • S3 = {x2, x3, x3,x3,x5,x6,x7, x8} • Run classifier on S3 to get h3 • Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} • Calculate error:

Calculate ’s: • Ensemble classifier:

Recall the training set: What is the accuracy of H on the training data? • where { x1, x2, x3, x4 } are class +1 • {x5, x6, x7, x8 } are class −1

HW 2: SVMs and Boosting Caveat: I don’t know if boosting will help. But worth a try!

Boosting implementation in HW 2 Use arrays and vectors. Let S be the training set. You can set S to DogsVsCats.train or some subset of examples. Let M = |S| be the number of training examples. Let K be the number of boosting iterations. ; Initialize: Read in S as an array of M rows and 65 columns. Create the weight vector w(1) = (1/M, 1/M, 1/M, ..., 1/M)

; Run Boosting Iterations: For T = 1 to K { ; Create training set T: For i = 1 to M { Select (with replacement) an example from S using weights in w. (Use “roulette wheel” selection algorithm) Print that example to S<T> (i.e., “S1”) ; Learn classifier T svm_learn <kernel parameters>S<T> h<T> ; Calculate Error<T> : svm_classify S h<T> <T>.predictions (e.g., “svm_classify S h1 1.predictions”) Use <T>.predictionsto determine which examples in S were misclassified. Error<T> = Sum of weights of examples that were misclassified.

; Calculate Alpha<T> ; Calculate new weight vector, using Alpha<T> and <T>.predictions file } ; T  T + 1 At this point, you have h1, h2, ..., hK [the K different classifiers] Alpha1, Alpha2, ..., AlphaK

; Run Ensemble Classifier on test data (DogsVsCats.test): For T = 1 to K: svm_classifyDogsVsCats.test h<T>.model Test(T).predictions } At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions To classify test example x1: The sign of row 1 of Test1.predictions gives h1(x1) The sign of row 1 of Test2.predictions gives h2(x1) etc. H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) + ... + AlphaK* hK(x1)] Compute H(x) for every x in the test set.

Sources of Classifier Error: Bias, Variance, and Noise • Bias: • Classifier cannot learn the correct hypothesis (no matter what training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets. • Variance: • Training data is not representative enough of all data, so the learned classifier h varies from one training set to another. • Noise: • Training data contains errors, so incorrect hypothesis h is learned. Give examples of these.

Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K. Optional: Read about “Margin-theory” explanation of success of Boosting

Ensemble Learning