1 / 89

Boosting and other Expert Fusion Strategies

This resource explores boosting and other expert fusion strategies for improving predictions by combining the outputs of multiple experts. It covers different types of multiple experts, online expert selection, hedge algorithm, Occam's Razor, weak and strong learning, majority algorithm, bagging, and boosting.

aluster
Download Presentation

Boosting and other Expert Fusion Strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting and other Expert Fusion Strategies

  2. References • http://www.boosting.org • Chapter 9.5 Duda Hart & Stock • Leo Breiman Boosting Bagging Arcing • Presentation adapted from: Rishi Sinha, Robin Dhamankar

  3. Types of Multiple Experts • Single expert on full observation space • Single expert for sub regions of observation space (Trees) • Multiple experts on full observation space • Multiple experts on sub regions of observation space

  4. Types of Multiple Experts Training • Use full observation space for each expert • Use different observation features for each expert • Use different observations for each expert • Combine the above

  5. Online Experts Selection • N strategies (experts) • At time t: • Learner A chooses a distribution over N experts • Let pt(i) be the probability of i-th expert • Spt(i) = 1 and for a loss vector lt Loss at time t: Spt(i) lt(i) • Assume bounded loss, lt(i) in [0,1]

  6. Experts Algorithm: Greedy • For each expert define its cumulative loss: • Greedy: At time t choose the expert with minimum loss, namely, arg min Lit

  7. Greedy Analysis • Theorem: Let LGT be the loss of Greedy at time T, then • Proof in notes. • Weakness: Relies on a single expert for every observation

  8. Better Multiple Experts Algorithms • Would like to bound • Better Bound: Hedge Algorithm Utilizes all experts for each observation

  9. Multiple Experts Algorithm: Hedge • Maintain weight vector at time t: wt • Probabilities pt(k) = wt(k) / S wt(j) • Initialization w1(i) = 1/N • Updates: • wt+1(k) = wt(k) Ub(lt(k)) • where b in [0,1] and • br < Ub (r) < 1-(1-b)r

  10. Hedge Analysis • Lemma: For any sequence of losses • Proof (Mansour’s scribe) • Corollary:

  11. Hedge: Properties • Bounding the weights • Similarly for a subset of experts.

  12. Hedge: Performance • Let k be with minimal loss • Therefore

  13. Hedge: Optimizing b • For b=1/2 we have • Better selection of b:

  14. Occam Razor • Finding the shortest consistent hypothesis. • Definition: (a,b)-Occam algorithm • a >0 and b <1 • Input: a sample S of size m • Output: hypothesis h • for every (x,b) in S: h(x)=b • size(h) < sizea(ct) mb • Efficiency.

  15. Occam Razor Theorem • A: (a,b)-Occam algorithm for C using H • D distribution over inputs X • ct in C the target function • Sample size: • with probability 1-d A(S)=h has error(h) < e

  16. Occam Razor Theorem • Use the bound for finite hypothesis class. • Effective hypothesis class size 2size(h) • size(h) < na mb • Sample size:

  17. Weak and Strong Learning

  18. PAC Learning model (Strong Learning) • There exists a distribution D over domain X • Examples: <x, c(x)> • use c for target function (rather than ct) • Goal: • With high probability (1-d) • find h in H such that • error(h,c ) < e • e arbitrarily small, thus STRONG LEARNING

  19. Weak Learning Model • Goal: error(h,c) < ½ - g (Slightly above chance) • The parameter g is small • constant Intuitively: A much easier task • Question: • Assume C is weak learnable, • C is PAC (strong) learnable

  20. Majority Algorithm • Hypothesis: hM(x)= MAJ[ h1(x), ... , hT(x) ] • size(hM) < T size(ht) • Using Occam Razor

  21. Majority: outline • Sample m example • Start with a distribution 1/m per example. • Modify the distribution and get ht • Hypothesis is the majority • Terminate when perfect classification • of the sample

  22. Majority: Algorithm • Use the Hedge algorithm. • The “experts” will be associate with points. • Loss would be a correct classification. • lt(i)= 1 - | ht(xi) – c(xi) | • Setting b= 1- g • hM(x) = MAJORITY( hi(x)) • Q: How do we set T?

  23. Majority: Analysis • Consider the set of errors S S={i | hM(xi)c(xi) } • For every i in S: Li / T < ½ (Proof!) • From Hedge properties:

  24. MAJORITY: Correctness • Error Probability: • Number of Rounds: • Terminate when error less than 1/m

  25. Bagging • Generate a random sample from training set by selecting elements with replacement. • Repeat this sampling procedure, getting a sequence of k “independent” training sets • A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm • To classify an unknown sample X, let each classifier predict. • The Bagged Classifier C* then combines the predictions of the individual classifiers to generate the final outcome. (sometimes combination is simple voting) Taken from Lecture slides for Data Mining Concepts and Techniques by Jiawei Han and M Kamber

  26. Boosting • Also Ensemble Method. =>The final prediction is a combination of the prediction of several predictors. • What is different? • Its iterative. • Boosting: Successive classifiers depends upon its predecessors. Previous methods : Individual classifiers were “independent” • Training Examples may have unequal weights. • Look at errors from previous classifier step to decide how to focus on next iteration over data • Set weights to focus more on ‘hard’ examples. (the ones on which we committed mistakes in the previous iterations)

  27. Boosting • W(x) is the distribution of weights over N training observations ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/N for all x, step k=0 • At each iteration k: • Find best weak classifier Ck(x) using weights Wk(x) • With error rate εk and based on a loss function: • weight αk the classifier Ck‘s weight in the final hypothesis • For each xi , update weights based on εk to get Wk+1(xi ) • CFINAL(x) =sign [ ∑ αi Ci (x) ]

  28. Boosting (Algorithm)

  29. Boosting As Additive Model • The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers • The process is iterative and can be expressed as follows. • Typically we would try to minimize a loss function on the training examples

  30. Boosting As Additive Model • Simple case: Squared-error loss • Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. • Squared-error loss not robust for classification

  31. Boosting As Additive Model • AdaBoost for Classification: L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

  32. Boosting As Additive Model First assume that β is constant, and minimize w.r.t. G:

  33. Boosting As Additive Model errm : It is the training error on the weighted samples The last equation tells us that in each iteration we must find a classifier that minimizes the training error on the weighted samples.

  34. Boosting As Additive Model Now that we have found G, we minimize w.r.t. β:

  35. Boosting (Recall) • W(x) is the distribution of weights over the N training observations ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/N for all x, step k=0 • At each iteration k: • Find best weak classifier Ck(x) using weights Wk(x) • With error rate εk and based on a loss function: • weight αk the classifier Ck‘s weight in the final hypothesis • For each xi , update weights based on εk to get Wk+1(xi ) • CFINAL(x) =sign [ ∑ αi Ci (x) ]

  36. AdaBoost • W(x) is the distribution of weights over the N training points ∑ W(xi)=1 • Initially assign uniform weights W0(x) = 1/Nfor all x. • At each iteration k: • Find best weak classifier Ck(x) using weights Wk(x) • Compute εk the error rate as εk= [ ∑ W(xi )∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )] • weight αk the classifier Ck‘s weight in the final hypothesis Set αk = log ((1 – εk )/εk ) • For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk ∙ I(yi ≠ Ck(xi ))] • CFINAL(x) =sign [ ∑ αi Ci (x) ]

  37. AdaBoost(Example) Original Training set : Equal Weights to all training samples Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

  38. AdaBoost (Example) ROUND 1

  39. AdaBoost (Example) ROUND 2

  40. AdaBoost (Example) ROUND 3

  41. AdaBoost (Example)

  42. AdaBoost (Characteristics) • Why exponential loss function? • Computational • Simple modular re-weighting • Derivative easy so determining optimal parameters is relatively easy • Statistical • In a two label case it determines one half the log odds of P(Y=1|x) => We can use the sign as the classification rule • Accuracy depends upon number of iterations ( How sensitive.. we will see soon).

  43. Boosting performance Decision stumps are very simple rules of thumb that test condition on a single attribute. Decision stumps formed the individual classifiers whose predictions were combined to generate the final prediction. The misclassification rate of the Boosting algorithm was plotted against the number of iterations performed.

  44. Boosting performance Steep decrease in error

  45. Boosting performance • Pondering over how many iterations would be sufficient…. • Observations • First few ( about 50) iterations increase the accuracy substantially.. Seen by the steep decrease in misclassification rate. • As iterations increase training error decreases ? and generalization error decreases ?

  46. Can Boosting do well if? • Limited training data? • Probably not .. • Many missing values ? • Noise in the data ? • Individual classifiers not very accurate ? • It cud if the individual classifiers have considerable mutual disagreement.

  47. Adaboost • “Probably one of the three most influential ideas in machine learning in the last decade, along with Kernel methods and Variational approximations.” • Original idea came from Valiant • Motivation: We want to improve the performance of a weak learning algorithm

  48. Adaboost • Algorithm:

  49. Boosting Trees Outline • Basics of boosting trees. • A numerical optimization problem • Control the model complexity, generalization • Size of trees • Number of Iterations • Regularization • Interpret the final model • Single variable • Correlation of variables

  50. Boosting Trees : Basics • Formally a tree is • The parameters found by minimizing the empirical risk. • Finding: • j given R j : typically mean of yi in Rj • Rj : Is tough but solutions exist.

More Related