250 likes | 291 Views
Learn about ensemble methods, bias-variance decomposition, bagging, boosting, and algorithms like AdaBoost. Understand how combining classifiers can improve prediction accuracy.
E N D
Ensemble Methods • An ensemble method constructs a set of base classifiers from the training data • Ensemble or Classifier Combination • Predict class label of previously unseen records by aggregating predictions made by multiple classifiers
Why does it work? • Suppose there are 25 base classifiers • Each classifier has error rate, = 0.35 • Assume classifiers are independent • Probability that the ensemble classifier makes a wrong prediction:
Methods • By manipulating the training dataset: a classifier is built for a sampled subset of the training dataset. • Two ensemble methods: bagging (bootstrap averaging) and boosting.
Characteristics • Ensemble methods work better with unstable classifiers. • Base classifiers that are sensitive to minor perturbations in the training set. • For example, decision trees and ANNs. • The variability among training examples is one of the primary sources of errors in a classifier.
Bias-Variance Decomposition • Consider the trajectories of a projectile launched at a particular angle. The observed distance can be divided into 3 components. • Force (f) and angle (θ) • Suppose the target is t, but the projectile hits at x at distance d away from t.
Two Decision Trees (2) • Bias: The stronger the assumptions made by a classifier about the nature of its decision boundary, the larger the classifier’s bias will be. • A smaller tree has a stronger assumption. • An algorithm cannot learn the target. • Variance: Variability in the training data affects the expected error, because different compositions of the training set may lead to different decision boundaries. • Intrinsic noise in the target class. • Target class for some domain can be non-deterministic. • Same attributes values with different class labels.
Bagging • Sampling with replacement • Build classifier on each bootstrap sample • Each sample has probability 1 - (1 – 1/n)n of being selected. When n is large, a bootstrap sample Di contains about 63.2% of the training data.
A Bagging Example (1) • Consider a one-level binary decision tree x <= k where k is a split point to minimize the entropy. • Without bagging, the best decision stump is • x <= 0.35 or x >= 0.75
Summary on Bagging • Bagging improves generalization error by reducing the variance of the base classifier. • Bagging depends on the stability of the base classifier. • If a base classifier is unstable, bagging helps to reduce the errors associated with random fluctuations in the training data. • If a base classifier is stable, then the error of the ensemble is primarily caused by bias in the base classifier. Bagging may make error larger, because the sample size is 37% smaller.
Boosting • An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records • Initially, all N records are assigned equal weights • Unlike bagging, weights may change at the end of boosting round • The weights can be used by a base classifier to learn a model that is biased toward higher-weight examples.
Boosting • Records that are wrongly classified will have their weights increased • Records that are classified correctly will have their weights decreased • Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
Example: AdaBoost • Base classifiers: C1, C2, …, CT • Error rate: • Importance of a classifier:
Example: AdaBoost • Weight update: • If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated • Classification:
A Boosting Example (1) • Consider a one-level binary decision tree x <= k where k is a split point to minimize the entropy. • Without bagging, the best decision stump is • x <= 0.35 or x >= 0.75