180 likes | 335 Views
Boosting of classifiers. Ata Kaban. Motivation & beginnings. Suppose we have a learning algorithm that is guaranteed with high probability to be slightly better than random guessing – we call this a weak learner
E N D
Boosting of classifiers Ata Kaban
Motivation & beginnings • Suppose we have a learning algorithm that is guaranteed with high probability to be slightly better than random guessing – we call this a weak learner • E.g. if an email contains the work “money” then classify it as spam, otherwise as non-spam • Is it possible to use this weak learning algorithm to create a strong classifier with error rate close to 0? • Ensemble learning – the wisdom of crowds • More heads are better than one
Motivation & beginnings • The answer is YES • Rob Shapire and Yoav Freund developed the Adaboost algorithm • Given: • Examples where • A weak learning algorithm A, that produces weak classifiers • Goal: Produce a new classifier with error Note, is not required to be in
Idea • Use the weak learning algorithm to produce a collection of weak classifiers • Modify the input each time when asking for a new weak classifier • Weight the training points differently • Find a good way to combine them
Main idea behind Adaboost • Iterative algorithm • Maintains a distribution of weights over the training examples • Initially weights are equal • At successive iterations the weight of misclassified examples is increased • This forces the algorithm to focus on the examples that have not been classified correctly in previous rounds • Take a linear combination of the predictions of the weak learners, with coefficients proportional to the performance of the weak learner.
Pseudo-code • For t=1,…,T • Construct a discrete probability distribution over indices of training points {1,2,…N}, denote it as • Run algorithm A on to produce weak classifier • Calculate where by the weak learning assumption this is slightly smaller than ½ (random guessing) • Output where
Details for pseudo-code • How to construct • How to determine Adaboost does these in the following way:
The weights of training points • Initially all weights are equal. • Weights of examples go up or down depending on how easy the example was to classify: If an example is easy it will get small weight , hard ones get large weights
The combination coefficients • Weighted vote, where the coefficient for weak-learner is related to how well the weak classifier performed on the weighted training set:
Comments • One can show that the training error of Adaboost drops exponentially fast as the rounds progress • The more rounds the more complex the final classifier is, so overfitting can happen • In practice overfitting is rarely observed and Adaboost tends to have excellent generalisation performance
Practical advantages of Adaboost • Can construct arbitrarily complex decision regions • Generic: Can use any classifier as weak learner, we only need it to be slightly better than random guessing • Simple to implement • Fast to run • Adaboost is one of the ‘top 10’ algorithms in data mining
Caveats • Adaboost can fail if there is noise in the class labels (wrong labels) • It can fail if the weak-learners are too complex • It can fail of the weak-learners are no better than random guessing
Topics not covered • Other combination schemes for classifiers • E.g. Bagging • Combinations for unsupervised learning
Further readings • Robert E. Schapire. The boosting approach to machine learning. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. • RobiPolikar. Ensemble Based Systems in Decision Making, IEEE Circuits and Systems Magazine, 6(3), pp. 21-45, 2006. http://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/csm06.pdf • Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. • Collection of papers: http://www.cs.gmu.edu/~carlotta/teaching/CS-795-s09/info.html