Combining models

Combining models Henrik Boström Stockholm University • Condorcet’s jury theorem • Bagging • Randomization • Random forests • Boosting • Stacking

Condorcet’s jury theorem If each member of a jury is more likely to be right than wrong, then the majority of the jury, too, is more likely to be right than wrong; and the probability that the right outcome is supported by a majority of the jury is a (swiftly) increasing function of the size of the jury, converging to 1 as the size of the jury tends to infinity. Condorcet, 1785

Condorcet’s jury theorem

Single trees vs. forests

Bagging A bootstrap replicateE’ of a set of examples E is created by randomly selecting n = |E| examples from E with replacement. The probability of an example in E appearing in E’ is L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996

Bootstrap replicate

Bagging Input: examples E, base learner BL, iterations n Output: combined model M i:= 0 Repeat i := i+1 Generate bootstrap replicateE’ of E Mi := BL(E’) Until i = N M := Average({M1, …, Mn}) * * For classification: majority vote, or the mean class probability distribution For regression: mean predicted value

Accuracy as a function ofensemble size heart-disease dataset from UCI repository stratified 10-fold cross-validation Bagged decision trees (RDS)

Using bagging • Works in the same way for classification and regression Any type of base learner may be used, but better results can often be obtained with more brittle methods, e.g. decision trees, neural networks • For decision trees, pruning often has a detrimental effect, i.e., the more variance of each individual model the better

Randomization • the setting of initial weights in a neural network • sampling of grow and prune data Repeated application of a base learner that contains some stochastic element (or that can be adjusted to contain stochastic elements) such as: • choice of attribute to split on in decision trees • Randomization may be combined with other techniques

Random forests • Random forests(Breiman 2001) are generated by • combining two techniques: • bagging (Breiman 1996) • the random subspace method (Ho 1998) L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001 L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996 T. K. Ho. The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998

The random subspace method e2 e2 e3 e4 e4e6 Fri/Sat yes no e4 e4 e2 e2 e3 e6 Wait=yes Other yes no e2 e2 e3 e6 Wait=no Wait=yes

Random forests Random forests = A set of classification trees generated with bagging and where only a randomly chosen subset of all available features (F) are considered at each node when generating the trees A common choice is to let the number of considered features to be equal to L. Breiman. Random Forests. Machine Learning. vol. 45 (2001) 5-32

Bagged trees vs. random forests heart-disease dataset from UCI repository stratified 10-fold cross-validation Base learner: decision trees (RDS)

Random forests

Averaging predictions

How do we evaluate the predictive performance? • Accuracy • - probability of correctly classifying an example • Area under ROC curve (AUC) • - probability of ranking a positive example ahead of a negative example

Predicting probabilities • Only the relative, and not the absolute, values of the predicted probabilities matter for the accuracy and AUC • Having accurate estimates of the probabilities are however important in many cases Stop at yellow light? no yes

Evaluating predicted probabilities Brier score

Decomposing the mean squared error Variance Bias = Brier score

Minimizing the Brier score MSE Variance Brier

Probability estimation trees (PETs) IF LogP > 4.56 AND Number_of_aromatic_bonds  9 THEN P(Solubility=Good) = 0.50P(Solubility=Medium) = 0.40P(Solubility=Poor) = 0.10

Estimating class probabilities + + + + + – + + +

Classification trees vs. probability-estimation trees

Empirical investigation • 34 datasets from the UCI repository • 10-fold cross-validation • Accuracy, area under ROC curve (AUC), Brier score • Random forests with 10, 25, 50, 100, 250 and 500 • - classification trees (CT) • - probability-estimation trees using • relative frequency (RF) • Laplace estimate (L) • m-estimate, where m = no. classes (MK) H. Boström. Forests of Probability Estimation Trees. International Journal of Pattern Recognition and Artificial Intelligence, vol. 26, no. 2 (2012)

Accuracy

AUC

Brier score

MSE

Variance

Boosting (AdaBoost) Input: examples e1,…, em, base learner BL, iterations nOutput: a set of model-weight pairs Mw1 := 1, …, wm := 1 i:= 0 M := repeati := i+1 Mi := BL({(e1,w1),…,(em,wm)}) Err := (w1*error(Mi,e1)+…+wm*error(Mi,em))/(w1+…+wm) if 0 < Err < 0.5 then for j := 1 to m if error(Mi,ej) = 0 then wj := wj*Err/(1-Err) M := M {(Mi,-log Err/(1-Err))} until i = n or Err ≥ 0.5 or Err = 0

Boosting (example) + + + - + - - - + - M1

Boosting (example) + + + - + - - - + - M2

Boosting (example) + 0.27 + + - M3 0.11 + 0.43 - - - + -

Boosting (example) + + + - + M3 - + - - - - - + - + M2 M1

Using boosting - More powerful base learners may be employed - The problem may be transformed into multiple binary classification problems - Specific versions of AdaBoost have been developed for multi-class problems For multi-class problems, it may be difficult to reduce the error below 50% with weak base learners (e.g., decision stumps) • Boosting can be sensitive to noise (i.e., erroneously labeled training examples) - More robust versions have been developed

Stacking Input: examples E, learner L, base learners BL1, …, BLn Output: a model M, a set of base models M1, …, Mn Divide E into one or more training and test set pairs For each base learner BLi, generate a model Mifor each training set and apply it to the corresponding test setE’:= {(y1,x11,…,x1n), …, (ym,xm1,…,xmn)}, where yiis the true label for the ith test example, and xij is the prediction of base learner BLj for that exampleM := L(E’)

- - + - - - + - - + + ? - - - + - + + - - + + + - + - - + - + + + - - - - - - - - - - - - - + - - - - + - + + ? - - - + - - - + ? - - + - - - - - Stacking IFM2 = good & M3 = good THENClass = good IFM2 = poor THEN Class = poor ...

Using stacking • Predictive performance can be improved by combining class probability estimates rather than class labels generated by the base models Linear models have been shown to be effective when learning the combination function

Combining models

Combining models

Presentation Transcript

Sentence Combining

Sentence Combining

UNCOMBINING? COMBINING?

Sentence Combining

Grammar Through Combining Sentence Combining

Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis

Combining sentences

Graphical models for combining multiple data sources

Combining

Combining RNA and Protein selection models

Combining

Combining measurements and models in atmospheric physics

COMBINING HETEROGENEOUS MODELS FOR MEASURING RELATIONAL SIMILARITY

Boosting ---one of combining models

Sentence Combining

Learning First-Order Probabilistic Models with Combining Rules

Combining Sentences

SENTENCE COMBINING

Combining Transformations

Combining measurements and models in atmospheric physics