1 / 41

An Introduction to Ensemble Methods Boosting, Bagging, Random Forests and More

An Introduction to Ensemble Methods Boosting, Bagging, Random Forests and More. Yisong Yue. Supervised Learning. Goal: learn predictor h(x) High accuracy (low error) Using training data {(x 1 ,y 1 ),…,( x n ,y n )}. h(x) = sign( w T x -b). Male?. Yes. No. Age>9?. Age>10?. Yes. Yes.

sheila
Download Presentation

An Introduction to Ensemble Methods Boosting, Bagging, Random Forests and More

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Ensemble MethodsBoosting, Bagging, Random Forests and More Yisong Yue

  2. Supervised Learning • Goal: learn predictor h(x) • High accuracy (low error) • Using training data {(x1,y1),…,(xn,yn)}

  3. h(x) = sign(wTx-b)

  4. Male? Yes No Age>9? Age>10? Yes Yes No No 1 0 1 0

  5. Splitting Criterion: entropy reduction (info gain) Complexity: number of leaves, size of leaves

  6. Outline • Bias/Variance Tradeoff • Ensemble methods that minimize variance • Bagging • Random Forests • Ensemble methods that minimize bias • Functional Gradient Descent • Boosting • Ensemble Selection

  7. Generalization Error • “True” distribution: P(x,y) • Unknown to us • Train: h(x) = y • Using training data S = {(x1,y1),…,(xn,yn)} • Sampled from P(x,y) • Generalization Error: • L(h) = E(x,y)~P(x,y)[ f(h(x),y) ] • E.g., f(a,b) = (a-b)2

  8. y h(x) Generalization Error: L(h) = E(x,y)~P(x,y)[ f(h(x),y) ] …

  9. Bias/Variance Tradeoff • Treat h(x|S) has a random function • Depends on training data S • L= ES[ E(x,y)~P(x,y)[ f(h(x|S),y) ] ] • Expected generalization error • Over the randomness of S

  10. Bias/Variance Tradeoff • Squared loss: f(a,b) = (a-b)2 • Consider one data point (x,y) • Notation: • Z = h(x|S) – y • ž= ES[Z] • Z-ž = h(x|S) – ES[h(x|S)] Expected Error ES[(Z-ž)2] = ES[Z2 – 2Zž + ž2] = ES[Z2] – 2ES[Z]ž + ž2 = ES[Z2] – ž2 ES[f(h(x|S),y)] = ES[Z2] = ES[(Z-ž)2] + ž2 Bias/Variance for all (x,y) is expectation over P(x,y). Can also incorporate measurement noise. (Similar flavor of analysis for other loss functions.) Variance Bias

  11. Example y x

  12. h(x|S)

  13. h(x|S)

  14. h(x|S)

  15. Bias Variance Bias Variance Variance Bias ES[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2 Z = h(x|S) – y ž = ES[Z] Variance Bias Expected Error

  16. Outline • Bias/Variance Tradeoff • Ensemble methods that minimize variance • Bagging • Random Forests • Ensemble methods that minimize bias • Functional Gradient Descent • Boosting • Ensemble Selection

  17. Bagging P(x,y) S’ • Goal: reduce variance • Ideal setting: many training sets S’ • Train model using each S’ • Average predictions sampled independently Variance reduces linearly Bias unchanged Z = h(x|S) – y ž = ES[Z] ES[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2 Variance Bias Expected Error “Bagging Predictors” [Leo Breiman, 1994] http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf

  18. Bagging S S’ • Goal: reduce variance • In practice: resample S’ with replacement • Train model using each S’ • Average predictions from S Variance reduces sub-linearly (Because S’ are correlated) Bias often increases slightly Z = h(x|S) – y ž = ES[Z] ES[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2 Variance Bias Expected Error Bagging = Bootstrap Aggregation “Bagging Predictors” [Leo Breiman, 1994] http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf

  19. DT Bagged DT Variance Better Bias Bias “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999)

  20. Random Forests • Goal: reduce variance • Bagging can only do so much • Resampling training data asymptotes • Random Forests: sample data & features! • Sample S’ • Train DT • At each node, sample features (sqrt) • Average predictions Further de-correlates trees “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf

  21. Better Average performance over many datasets Random Forests perform the best “An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina, ICML 2008

  22. Structured Random Forests • DTs normally train on unary labels y=0/1 • What about structured labels? • Must define information gain of structured labels • Edge detection: • E.g., structured label is a 16x16 image patch • Map structured labels to another space • where entropy is well defined “Structured Random Forests for Fast Edge Detection” Dollár & Zitnick, ICCV 2013

  23. Image Ground Truth RF Prediction Comparable accuracy vs state-of-the-art Orders of magnitude faster “Structured Random Forests for Fast Edge Detection” Dollár & Zitnick, ICCV 2013

  24. Outline • Bias/Variance Tradeoff • Ensemble methods that minimize variance • Bagging • Random Forests • Ensemble methods that minimize bias • Functional Gradient Descent • Boosting • Ensemble Selection

  25. Functional Gradient Descent h(x) = h1(x) + h2(x) + … + hn(x) S’ = {(x,y)} S’ = {(x,y-h1(x))} S’ = {(x,y-h1(x) - … - hn-1(x))} … h1(x) h2(x) hn(x) http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

  26. Coordinate Gradient Descent • Learn w so that h(x) = wTx • Coordinate descent • Init w = 0 • Choose dimension with highest gain • Set component of w • Repeat

  27. Coordinate Gradient Descent • Learn w so that h(x) = wTx • Coordinate descent • Init w = 0 • Choose dimension with highest gain • Set component of w • Repeat

  28. Coordinate Gradient Descent • Learn w so that h(x) = wTx • Coordinate descent • Init w = 0 • Choose dimension with highest gain • Set component of w • Repeat

  29. Coordinate Gradient Descent • Learn w so that h(x) = wTx • Coordinate descent • Init w = 0 • Choose dimension with highest gain • Set component of w • Repeat

  30. Coordinate Gradient Descent • Learn w so that h(x) = wTx • Coordinate descent • Init w = 0 • Choose dimension with highest gain • Set component of w • Repeat

  31. Coordinate Gradient Descent • Learn w so that h(x) = wTx • Coordinate descent • Init w = 0 • Choose dimension with highest gain • Set component of w • Repeat

  32. Functional Gradient Descent h(x) = h1(x) + h2(x) + … + hn(x) Coordinate descent in function space Restrict weights to be 0,1,2,… “Function Space” (All possible DTs)

  33. Boosting (AdaBoost) h(x) = a1h1(x) + a2h2(x) + … + a3hn(x) S’ = {(x,y,u1)} S’ = {(x,y,u2)} S’ = {(x,y,u3))} … h1(x) h2(x) hn(x) u – weighting on data points a – weight of linear combination Stop when validation performance plateaus (will discuss later) https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf

  34. Initial Distribution of Data Train model Error of model Coefficient of model Update Distribution Final average Theorem: training error drops exponentially fast https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf

  35. DT Bagging AdaBoost Variance Better Boosting often uses weak models E.g, “shallow” decision trees Weak models have lower variance Bias Bias “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999)

  36. Ensemble Selection Training S’ H = {2000 models trained using S’} Validation V’ S Maintain ensemble model as combination of H: + hn+1(x) h(x) = h1(x) + h2(x) + … + hn(x) Denote as hn+1 Add model from H that maximizes performance on V’ Repeat Models are trained on S’ Ensemble built to optimize V’ “Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004

  37. Ensemble Selection often outperforms a more homogenous sets of models. Reduces overfitting by building model using validation set. Ensemble Selection won KDD Cup 2009 http://www.niculescu-mizil.org/papers/KDDCup09.pdf “Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004

  38. State-of-the-art prediction performance • Won Netflix Challenge • Won numerous KDD Cups • Industry standard …and many other ensemble methods as well.

  39. References & Further Reading “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Bauer & Kohavi, MachineLearning, 36, 105–139 (1999) “Bagging Predictors”Leo Breiman, Tech Report #421, UC Berkeley, 1994, http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “An Empirical Comparison of Supervised Learning Algorithms”Caruana & Niculescu-Mizil, ICML 2006 “An Empirical Evaluation of Supervised Learning in High Dimensions”Caruana, Karampatziakis & Yessenalina, ICML 2008 “Ensemble Methods in Machine Learning” Thomas Dietterich, Multiple Classifier Systems, 2000 “Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004 “Getting the Most Out of Ensemble Selection”Caruana, Munson, & Niculescu-Mizil, ICDM 2006 “Explaining AdaBoost” Rob Schapire, https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf “Greedy Function Approximation: A Gradient Boosting Machine”, Jerome Friedman, 2001, http://statweb.stanford.edu/~jhf/ftp/trebst.pdf “Random Forests – Random Features” Leo Breiman, Tech Report #567, UC Berkeley, 1999, “Structured Random Forests for Fast Edge Detection” Dollár& Zitnick, ICCV 2013 “ABC-Boost: Adaptive Base Class Boost for Multi-class Classification” Ping Li, ICML 2009 “Additive Groves of Regression Trees” Sorokina, Caruana& Riedewald, ECML 2007, http://additivegroves.net/ “Winning the KDD Cup Orange Challenge with Ensemble Selection”, Niculescu-Mizil et al., KDD 2009 “Lessons from the Netflix Prize Challenge” Bell & Koren, SIGKDD Exporations 9(2), 75—79, 2007

More Related