1 / 39

Combining models

Combining models. Henrik Boström Stockholm University. Condorcet’s jury theorem Bagging Randomization Random forests Boosting Stacking. Condorcet’s jury theorem.

kevlyn
Download Presentation

Combining models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining models Henrik Boström Stockholm University • Condorcet’s jury theorem • Bagging • Randomization • Random forests • Boosting • Stacking

  2. Condorcet’s jury theorem If each member of a jury is more likely to be right than wrong, then the majority of the jury, too, is more likely to be right than wrong; and the probability that the right outcome is supported by a majority of the jury is a (swiftly) increasing function of the size of the jury, converging to 1 as the size of the jury tends to infinity. Condorcet, 1785

  3. Condorcet’s jury theorem

  4. Single trees vs. forests

  5. Bagging A bootstrap replicateE’ of a set of examples E is created by randomly selecting n = |E| examples from E with replacement. The probability of an example in E appearing in E’ is L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996

  6. Bootstrap replicate

  7. Bagging Input: examples E, base learner BL, iterations n Output: combined model M i:= 0 Repeat i := i+1 Generate bootstrap replicateE’ of E Mi := BL(E’) Until i = N M := Average({M1, …, Mn}) * * For classification: majority vote, or the mean class probability distribution For regression: mean predicted value

  8. Accuracy as a function ofensemble size heart-disease dataset from UCI repository stratified 10-fold cross-validation Bagged decision trees (RDS)

  9. Using bagging • Works in the same way for classification and regression Any type of base learner may be used, but better results can often be obtained with more brittle methods, e.g. decision trees, neural networks • For decision trees, pruning often has a detrimental effect, i.e., the more variance of each individual model the better

  10. Randomization • the setting of initial weights in a neural network • sampling of grow and prune data Repeated application of a base learner that contains some stochastic element (or that can be adjusted to contain stochastic elements) such as: • choice of attribute to split on in decision trees • Randomization may be combined with other techniques

  11. Random forests • Random forests(Breiman 2001) are generated by • combining two techniques: • bagging (Breiman 1996) • the random subspace method (Ho 1998) L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001 L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996 T. K. Ho. The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998

  12. The random subspace method e2 e2 e3 e4 e4e6 Fri/Sat yes no e4 e4 e2 e2 e3 e6 Wait=yes Other yes no e2 e2 e3 e6 Wait=no Wait=yes

  13. Random forests Random forests = A set of classification trees generated with bagging and where only a randomly chosen subset of all available features (F) are considered at each node when generating the trees A common choice is to let the number of considered features to be equal to L. Breiman. Random Forests. Machine Learning. vol. 45 (2001) 5-32

  14. Bagged trees vs. random forests heart-disease dataset from UCI repository stratified 10-fold cross-validation Base learner: decision trees (RDS)

  15. Random forests

  16. Averaging predictions

  17. How do we evaluate the predictive performance? • Accuracy • - probability of correctly classifying an example • Area under ROC curve (AUC) • - probability of ranking a positive example ahead of a negative example

  18. Predicting probabilities • Only the relative, and not the absolute, values of the predicted probabilities matter for the accuracy and AUC • Having accurate estimates of the probabilities are however important in many cases Stop at yellow light? no yes

  19. Evaluating predicted probabilities Brier score

  20. Decomposing the mean squared error Variance Bias = Brier score

  21. Minimizing the Brier score MSE Variance Brier

  22. Probability estimation trees (PETs) IF LogP > 4.56 AND Number_of_aromatic_bonds  9 THEN P(Solubility=Good) = 0.50P(Solubility=Medium) = 0.40P(Solubility=Poor) = 0.10

  23. Estimating class probabilities + + + + + – + + +

  24. Classification trees vs. probability-estimation trees

  25. Empirical investigation • 34 datasets from the UCI repository • 10-fold cross-validation • Accuracy, area under ROC curve (AUC), Brier score • Random forests with 10, 25, 50, 100, 250 and 500 • - classification trees (CT) • - probability-estimation trees using • relative frequency (RF) • Laplace estimate (L) • m-estimate, where m = no. classes (MK) H. Boström. Forests of Probability Estimation Trees. International Journal of Pattern Recognition and Artificial Intelligence, vol. 26, no. 2 (2012)

  26. Accuracy

  27. AUC

  28. Brier score

  29. MSE

  30. Variance

  31. Boosting (AdaBoost) Input: examples e1,…, em, base learner BL, iterations nOutput: a set of model-weight pairs Mw1 := 1, …, wm := 1 i:= 0 M := repeati := i+1 Mi := BL({(e1,w1),…,(em,wm)}) Err := (w1*error(Mi,e1)+…+wm*error(Mi,em))/(w1+…+wm) if 0 < Err < 0.5 then for j := 1 to m if error(Mi,ej) = 0 then wj := wj*Err/(1-Err) M := M {(Mi,-log Err/(1-Err))} until i = n or Err ≥ 0.5 or Err = 0

  32. Boosting (example) + + + - + - - - + - M1

  33. Boosting (example) + + + - + - - - + - M2

  34. Boosting (example) + 0.27 + + - M3 0.11 + 0.43 - - - + -

  35. Boosting (example) + + + - + M3 - + - - - - - + - + M2 M1

  36. Using boosting - More powerful base learners may be employed - The problem may be transformed into multiple binary classification problems - Specific versions of AdaBoost have been developed for multi-class problems For multi-class problems, it may be difficult to reduce the error below 50% with weak base learners (e.g., decision stumps) • Boosting can be sensitive to noise (i.e., erroneously labeled training examples) - More robust versions have been developed

  37. Stacking Input: examples E, learner L, base learners BL1, …, BLn Output: a model M, a set of base models M1, …, Mn Divide E into one or more training and test set pairs For each base learner BLi, generate a model Mifor each training set and apply it to the corresponding test setE’:= {(y1,x11,…,x1n), …, (ym,xm1,…,xmn)}, where yiis the true label for the ith test example, and xij is the prediction of base learner BLj for that exampleM := L(E’)

  38. - - + - - - + - - + + ? - - - + - + + - - + + + - + - - + - + + + - - - - - - - - - - - - - + - - - - + - + + ? - - - + - - - + ? - - + - - - - - Stacking IFM2 = good & M3 = good THENClass = good IFM2 = poor THEN Class = poor ...

  39. Using stacking • Predictive performance can be improved by combining class probability estimates rather than class labels generated by the base models Linear models have been shown to be effective when learning the combination function

More Related