Advanced Machine Learning & Perception

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara

Tony Jebara, Columbia University Boosting • Combining Multiple Classifiers • Voting • Boosting • Adaboost • Based on material by Y. Freund, P. Long & R. Schapire

Tony Jebara, Columbia University Combining Multiple Learners • Have many simple learners • Also called base learners or • weak learners which have a classification error of <0.5 • Combine or vote them to get a higher accuracy • No free lunch: there is no guaranteed best approach here • Different approaches: • Voting • combine learners with fixed weight • Mixture of Experts • adjust learners and a variable weight/gate fn • Boosting • actively search for next base-learners and vote • Cascading, Stacking, Bagging, etc.

Tony Jebara, Columbia University Voting • Have T classifiers • Average their prediction with weights • Like mixture of experts but weight is constant with input y a a a Expert Expert Expert x

Tony Jebara, Columbia University Mixture of Experts • Have T classifiers or experts and a gating fn • Average their prediction with variable weights • But, adapt parameters of the gating function • and the experts (fixed total number T of experts) Output Gating Network Expert Expert Expert Input

Tony Jebara, Columbia University Boosting • Actively find complementary or synergistic weak learners • Train next learner based on mistakes of previous ones. • Average prediction with fixed weights • Find next learner by training on weighted versions of data. training data weak rule weak learner ensemble of learners weights prediction

Tony Jebara, Columbia University AdaBoost • Most popular weighting scheme • Define margin for point i as • Find an ht and find weight at to min the cost function • sum exp-margins training data weak rule weak learner ensemble of learners weights prediction

Tony Jebara, Columbia University AdaBoost • Choose base learner & at to • Recall error of • base classifier ht must be • For binary h, Adaboost puts this weight on weak learners: • (instead of the • more general rule) • Adaboost picks the following for the weights on data for • the next round (here Z is the normalizer to sum to 1)

Tony Jebara, Columbia University Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1

Tony Jebara, Columbia University Y -1 +1 +0.2 -0.1 +0.1 X>3 no yes -1 -0.3 +0.1 sign -0.1 Y>5 no yes -0.3 +0.2 X Decision tree as a sum -0.2 -0.2

Tony Jebara, Columbia University Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7

Tony Jebara, Columbia University Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.

Tony Jebara, Columbia University Ad-Tree Example

Tony Jebara, Columbia University Cross-validated accuracy

Tony Jebara, Columbia University Logitboost Loss Brownboost 0-1 loss Margin Mistakes Correct AdaBoost Convergence • Rationale? • Consider bound on • the training error: • Adaboost is essentially doing gradient descent on this. • Convergence? exp bound on step definition of f(x) recursive use of Z

Tony Jebara, Columbia University AdaBoost Convergence • Convergence? Consider the binary ht case. So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!

Tony Jebara, Columbia University Curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters

Tony Jebara, Columbia University Explanation using margins 0-1 loss Margin

Tony Jebara, Columbia University No examples with small margins!! Explanation using margins 0-1 loss Margin

Tony Jebara, Columbia University Experimental Evidence

Tony Jebara, Columbia University AdaBoost Generalization • Also, a VC analysis gives a generalization bound: • (where d is VC of • base classifier) • But, more iterations  overfitting! • A margin analysis is possible, redefine margin as: • Then have

Tony Jebara, Columbia University AdaBoost Generalization • Suggests this optimization problem: Margin

Tony Jebara, Columbia University AdaBoost Generalization • Proof Sketch

Tony Jebara, Columbia University UCI Results % test error rates

Advanced Machine Learning & Perception