170 likes | 322 Views
A New Boosting Algorithm Using Input-Dependent Regularizer. Rong Jin 1 , Yan Liu 2 , Luo Si 2 , Jamie Carbonell 2 , Alex G. Hauptmann 2 1. Michigan State University, 2. Carnegie Mellon University. Outline. Introduction of AdaBoost algorithm Problems with AdaBoost
E N D
A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin1, Yan Liu2, Luo Si2, Jamie Carbonell2, Alex G. Hauptmann2 1. Michigan State University, 2. Carnegie Mellon University
Outline • Introduction of AdaBoost algorithm • Problems with AdaBoost • New boosting algorithm: input-dependent regularizer • Experiment • Conclusion and future work
AdaBoost Algorithm (I) • Boost a weak classifier into a strong classifier by linearly combine an ensemble of weak classifiers • AdaBoost • Given:A weak classifier h(x) with a large classification error E(x,y)~P(x,y)(h(x)y) • Output: HT(x)= 1h1(x) + 2h2(x) +…+ThT(x) with a low classification error E(x,y)~P(x,y)(H(x)y)
Sampling distribution Only focus on the examples that are misclassified or weakly classified by previous weak classifiers Combining Weak Classifiers Combination constants are computed in order to minimize the training error Choice of t: AdaBoosting Algorithm (II)
Problems 1: Overfitting • AdaBoost seldom overfits • Not only minimizes the training error but also tends to maximize the classification margin (Ondar & Muller, 1998; Friedman et al., 1998) • AdaBoost does overfit when the data are noisy (Dietterich, 2000; Ratsch & Muller, 2000; Grove & Schuurmans, 1998) • Sampling distribution Dt(x) can have overly emphasis on noisy patterns • Due to the “hard margin” criteria (Ratsch et al., 2000)
Problems 1: Overfitting • Introduce regularization • Not only just minimize the training error • Typical solutions • Smooth the combination constant (Schapire & Singer, 1998) • Epsilon boosting: equal to L1 regularization (Friedman & Tibshirani, 1998) • Boosting with soft margin (Ratsch et. al, 2000) • BrownBoost: a non monotonic cost function (Freund, 2001)
Problem 2: Why Linear Combination? • Each weak classifier ht(x) is trained on a different sampling distribution Dt(x) • only good for particular types of input patterns • {ht(x)} is a diverse ensemble • Linear combination is not able to take full strength of the diverse ensemble {ht(x)} • Solution: combination constants should be input dependent
Input Dependent Regularizer • Solve the two problems • overfitting and constant combination • Input dependent regularizer • Main idea: different combination form
Role of • Regularizer • Prevent |HT(x)| from growing too fast • Theorem: if all t are bounded max, |HT(x)| a ln(bT+c) • For the of linear combination in AdaBoost, |HT(x)|~O(T) • Router • Input dependent combination constant • The prediction of ht(x) is used only when Ht-1(x) is small • Consistent with the training procedure • ht(x) is trained on the examples that Ht-1(x) is uncertain
WeightBoost Algorithm (1) • Similar to AdaBoost: minimize the exponential cost function • Training setup • hi(x): x{1,-1}; a basis (weak) classifier • HT(x): a linear combination of basic classifiers • Goal: minimize training error
Emphasize misclassified data patterns Avoid overemphasis on noisy data patterns WeightBoost Algorithm (2) As Simple As AdaBoost ! Choice of t:
Empirical studies • Datasets: eight different UCI datasets with only binary classes • Methods to compare with • AdaBoost algorithm • WeightDecay Boost algorithm: close to L2 regularization • Epsilon Boosting: related to L1 regularization
Experiment 1: Effectiveness • Compare to AdaBoost • The WeightBoost performs better than AdaBoost algorithm. • In many cases, the WeightBoost performs substantially better than AdaBoost algorithm
Experiment 2: Beyond Regularization • Compare to other regularized boosting • WeightDecay Boost and Epsilon Boost • The WeightBoost performs slightly better than other regularized boosting algorithms • In several cases, the WeightBoost performs better than the other two regularized boosting algorithms
Results for 10% Noise Experiment 3: Resistance to Noise • Randomly select 10%, 20%, and 30% of training data and set the labels of training data to be random value • The WeightBoost is more resistant to training noise than AdaBoost algorithm • In several cases, when AdaBoost overfits the training noises, WeightBoost is still able to perform well
Experiments with Text Categorization • Reuter-21578 corpus with 10 most popular categories: WeightBoost improves 7 out of 10 categories
Conclusion and Future Work • Introduce an input dependent regularizer into the combination form • Prevent |H(x)| from increasing too fast resistant to training noise • ‘Route’ a testing data pattern to it’s appropriate classifier improve the classification accuracy even further than standard regularization • Future research issues • How to determine the constant ? • Other input dependent regularizer?