250 likes | 501 Views
Boosting and Additive Trees (Part 1). Ch. 10 Presented by Tal Blum. Overview. Ensemble methods and motivations Describing Adaboost.M1 algorithm Show that Adaboost maximizes the exponential loss Other loss functions for classification and regression. Ensemble Learning – Additive Models.
E N D
Boosting and Additive Trees(Part 1) Ch. 10 Presented by Tal Blum
Overview • Ensemble methods and motivations • Describing Adaboost.M1 algorithm • Show that Adaboost maximizes the exponential loss • Other loss functions for classification and regression
Ensemble Learning – Additive Models • INTUITION: Combining Predictions of an ensemble is more accurate than a single classifier. • Justification: ( Several reasons) • easy to find quite correct “rules of thumb” however hard to find single highly accurate prediction rule. • If the training examples are few and the hypothesis space is large then there are several equally accurate classifiers. (model uncertainty) • Hypothesis space does not contain the true function, but a linear combination of hypotheses might. • Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers. • Examples: Bagging, HME, Splines
Example learning curve for Y = 1 if X2j > 210(0.5) 0 otherwise
Adaboost.M1 Algorithn • W(x) is the distribution of weights over the N training points ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/Nfor all x. • At each iteration k: • Find best weak classifier Ck(x) using weights Wk(x) • Compute εk the error rate as εk= [ ∑ W(xi )∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )] • weight αk the classifier Ck‘s weight in the final hypothesis Set αk = log ((1 – εk )/εk ) • For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk∙I(yi ≠ Ck(xi ))] • CFINAL(x) =sign [ ∑ αk Ck (x) ]
Boosting asan Additive Model • The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers • The process is iterative and can be expressed as follows. • Typically we would try to minimize a loss function on the training examples
Forward Stepwise Additive Modeling - algorithm • Initialize f0(x)=0 • For m = 1 to M • Compute • Set
Forward Stepwise Additive Modeling • Sequentially adding new basis functions without adjusting the parameters of the previously chosen functions • Simple case: Squared-error loss • Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. • Squared-error loss not robust for classification
Exponential Lossand Adaboost • AdaBoost for Classification: • L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function
Exponential Lossand Adaboost • Assuming 0:
N å ( m ) × ¹ [ w I ( y G ( x ) )] i i i b - b - b - = + i 1 arg min ( e e ) e N å G ( m ) w i = i 1 b - b - b = - × + = b arg min ( e e ) err e H ( ) m G Finding the best
Historical Notes • Adaboost was first presented in ML theory as a way to boost a week classifier • At first people thought it defies the “no free lunch theorem” and doesn’t overfitt. • Connection between Adaboost and stepwise additive modeling was only recently discovered.
Why Exponential Loss? • Mainly Computational • Derivatives are easy to compute • Optimal classifiers minimizes the weighted sample • Under mild assumptions the instances weights decrease exponentially fast. • Statistical • Exp. loss is not necessary for success of boosting – On Boosting and exponential loss (Wyner) • We will see in the next slides
Why Exponential Loss? • Population minimizer (Friedman 2000): • This justifies using its sign as a classification rule.
Why Exponential Loss? • For exponential loss: • Interpreting f as a logit transform • The population maximizers and are the same
Loss Functions and Robustness • For a finite dataset exp. loss and binomial deviance are not the same. • Both criterion are monotonic decreasing functions of the margin. • Examples with negative margin y*f(x)<0 are classified incorrectly.
Loss Functions and Robustness • The problem: Classification error is not differentiable and with derivative 0 where it is differentiable. • We want a criterion which is efficient and as close as possible to the true classification lost. • Any loss criterion used for classification should give higher weights to misclassified examples. • Therefore the square loss function is not appropriate for classification.
Loss Functions and Robustness • Both functions can be though of as a continuous approximation to the misclassification loss • Exponential lost grows exponentially fast for instances with high margin • Such instances weight increases exponentially • This makes Adaboost very sensitive to mislabeled examples • Deviation generalizes to K classes, exp loss not.
Robust Loss FunctionsFor Regression • The relationship between square loss and absolute loss is analogous to that of exp. loss and deviance. • The solutions are the mean and median. • Absolute loss is more robust. • For regression MSE leads to Adaboost for regression • For Gaussian errors and robustness to outliers • Huber loss: