Boosting and Additive Trees (Part 1)

Boosting and Additive Trees(Part 1) Ch. 10 Presented by Tal Blum

Overview • Ensemble methods and motivations • Describing Adaboost.M1 algorithm • Show that Adaboost maximizes the exponential loss • Other loss functions for classification and regression

Ensemble Learning – Additive Models • INTUITION: Combining Predictions of an ensemble is more accurate than a single classifier. • Justification: ( Several reasons) • easy to find quite correct “rules of thumb” however hard to find single highly accurate prediction rule. • If the training examples are few and the hypothesis space is large then there are several equally accurate classifiers. (model uncertainty) • Hypothesis space does not contain the true function, but a linear combination of hypotheses might. • Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers. • Examples: Bagging, HME, Splines

Boosting (explaining)

Example learning curve for Y = 1 if  X2j > 210(0.5) 0 otherwise

Adaboost.M1 Algorithn • W(x) is the distribution of weights over the N training points ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/Nfor all x. • At each iteration k: • Find best weak classifier Ck(x) using weights Wk(x) • Compute εk the error rate as εk= [ ∑ W(xi )∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )] • weight αk the classifier Ck‘s weight in the final hypothesis Set αk = log ((1 – εk )/εk ) • For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk∙I(yi ≠ Ck(xi ))] • CFINAL(x) =sign [ ∑ αk Ck (x) ]

Boosting asan Additive Model • The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers • The process is iterative and can be expressed as follows. • Typically we would try to minimize a loss function on the training examples

Forward Stepwise Additive Modeling - algorithm • Initialize f0(x)=0 • For m = 1 to M • Compute • Set

Forward Stepwise Additive Modeling • Sequentially adding new basis functions without adjusting the parameters of the previously chosen functions • Simple case: Squared-error loss • Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. • Squared-error loss not robust for classification

Exponential Lossand Adaboost • AdaBoost for Classification: • L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

Exponential Lossand Adaboost • Assuming  0:

N å ( m ) × ¹ [ w I ( y G ( x ) )] i i i b - b - b - = + i 1 arg min ( e e ) e N å G ( m ) w i = i 1 b - b - b = - × + = b arg min ( e e ) err e H ( ) m G Finding the best 

Historical Notes • Adaboost was first presented in ML theory as a way to boost a week classifier • At first people thought it defies the “no free lunch theorem” and doesn’t overfitt. • Connection between Adaboost and stepwise additive modeling was only recently discovered.

Why Exponential Loss? • Mainly Computational • Derivatives are easy to compute • Optimal classifiers minimizes the weighted sample • Under mild assumptions the instances weights decrease exponentially fast. • Statistical • Exp. loss is not necessary for success of boosting – On Boosting and exponential loss (Wyner) • We will see in the next slides

Why Exponential Loss? • Population minimizer (Friedman 2000): • This justifies using its sign as a classification rule.

Why Exponential Loss? • For exponential loss: • Interpreting f as a logit transform • The population maximizers and are the same

Loss Functions and Robustness • For a finite dataset exp. loss and binomial deviance are not the same. • Both criterion are monotonic decreasing functions of the margin. • Examples with negative margin y*f(x)<0 are classified incorrectly.

Loss Functions and Robustness • The problem: Classification error is not differentiable and with derivative 0 where it is differentiable. • We want a criterion which is efficient and as close as possible to the true classification lost. • Any loss criterion used for classification should give higher weights to misclassified examples. • Therefore the square loss function is not appropriate for classification.

Loss Functions and Robustness • Both functions can be though of as a continuous approximation to the misclassification loss • Exponential lost grows exponentially fast for instances with high margin • Such instances weight increases exponentially • This makes Adaboost very sensitive to mislabeled examples • Deviation generalizes to K classes, exp loss not.

Robust Loss FunctionsFor Regression • The relationship between square loss and absolute loss is analogous to that of exp. loss and deviance. • The solutions are the mean and median. • Absolute loss is more robust. • For regression MSE leads to Adaboost for regression • For Gaussian errors and robustness to outliers • Huber loss:

Sample of UCI datasets Comparison

Next Presentation

Boosting and Additive Trees (Part 1)

Boosting and Additive Trees (Part 1)

Presentation Transcript

Boosting and Additive Trees (2)

Chapter 4. Trees: Part 1

Information Gain, Decision Trees and Boosting

Lectures 17,18 – Boosting and Additive Trees

B + -Trees (Part 1)

Additive Groves of Regression Trees

Trees, Part 1

Additive Models ， Trees ， and Related Models

B + -Trees (Part 1)

AVL-Trees (Part 1)

Decision Trees and Boosting

Modeling Additive Structure and Detecting Interactions with Additive Groves of Regression Trees

B+-Trees (PART 1)

Additive Models, Trees, and Related Methods

Lecture 10. Trees and Boosting

B + -Trees (Part 1)

Additive Models ， Trees ， and Related Models

Trees : Part 1 Section 4.1

Additive Models ， Trees ， and Related Models

Decision Trees and Boosting

B + -Trees (Part 1)

Additive Models, Trees, and Related Methods