Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 14. Combining ModelsPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by J.-H. Eom Biointelligence Laboratory Seoul National University http://bi.snu.ac.kr/

Contents • 14.1 Bayesian Model Averaging • 14.2 Committees • 14.3 Boosting • 14.3.1 Minimizing exponential error • 14.3.2 Error functions for boosting • 14.4 Tree-based Models • 14.5 Conditional Mixture Models • 14.5.1 Mixtures of linear regression models • 14.5.2 Mixtures of logistic models • 14.5.3 Mixtures of experts (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14. CombiningModels - Mixture distributions - Mixture of expert Input dependant mixing coefficient p(k|x). • Committees • Combine multiple models in some ways can improve performance • Boosting – variant of the committee method • Train multiple model in sequence • The error function used to train a particular model depends on the performance of the previous models • Achieve substantial improvements in performance than single model • Model combination • a. Averaging the predictions of a set of models • b. Selecting one of the models to make the prediction • Difference models become responsible for different input space regions (Ex: Decision tree) • Hard split-based & Only one model is responsible for making predictions for any given value of the input variables • DT – as a model combination (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.1 Bayesian Model Averaging ※ BMA Model combination Model (in terms of a joint distribution): Corresponding density over the observed variable x: For Gaussian mixture: For i.i.d. data:  Eachobserved data point xn hasa corresponding latent variable zn. With several different models (h: 1 ~ H) with prior p(h),  then the marginal distribution over the data set : ※※  BMA • One model is responsible for generating whole data set • Prob. dist. Over h reflects uncertainty of model (data set size inc.  reduces) • Posterior p(h|X) become increasingly focused on just one of the models • Example: Density estimation using a mixture Gaussians (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.1 Bayesian Model Averaging (cont’d) BMA Model combination Generated by different components  Different data points in the data set can potentially be generated from different values of the latent variable z  14.5 • Generated by a single model (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.2 Committees Regression problem; to predict the value of a single continuous variable Generate M bootstrap data sets & for training each separate copy of ym(x) of a predictive model (m = 1 ~ M) “Bootstrap aggregation” “bagging” Committee predictions; ※ For true regression (predict h(x) ※ The output of each models: true value + error • Waysto construct a committee • Average the predictions of a set of individual models (simplest) • From frequentist perspective; bias & variance • Have only a single data set (in practice) • Find a way to introduce variability between the different models within the committee •  “Bootstrap” data set (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.2 Committees (cont’d) The sum-of-squares error: The avg. error of individual acting models: The expected error from the committee : Assume the errors have zero mean & uncorrelated,then  Average error of a model  reduced by a factor of M simply by averagingM version of the model ( key assumption: uncorrelated error of individual models) In practice: highly correlated error & General error reduction is small  but, expected committee error will not exceed the expected error of the constituent models, (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting Figure 14.1 • Boosting • Combine multiple ‘base’ classifiers to produce powerful committee • AdaBoost – ‘adaptive boosting’ (Freund & Schapire; 96) • Most widely used form of boosting • Give good results with ‘weak’ learners • Characteristics • Baseclassifiers are trained in sequence (*) • Each base classifier is trained using a weighted form of the data • Weighting coefficient associated with each data point  depend on the performance of the previous classifiers • Give greater weight on the misclassified data point by the previous classifiers • Final prediction  use weighted majority voting scheme (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting – AdaBoost algorithm 1. Initialize the data weighting coefficients {wn} by setting for n = 1, …, N . 2. For m = 1, …, M: (a) Fit a classifier ym(x) to the training data by minimizing the weighted error function Eq. 14.15 Weighting coefficient (b) Evaluate the quantities Weighted measures of the error rates ofeach of the base classifiers on the data set Eq. 14.16 Eq. 14.17 (c) Update the data weight coefficients 3. Make the predictions using the final model, which is given by (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting – AdaBoost (cont’d) Decision boundary ofmost recent base learner Base learners trained so far Combined decision boundary • Figure 14.2 – 30 data points, binary class • Base learners – consists of a threshold on one of the input vars. • ‘Decision stump’ – DT with a single node (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting 14.3.1 Minimizing exponential error Eq. 14.22 • “Boosting” • Actual performance is much better than the bounds expected • ‘Sequential minimization of an exponential error function’ •  Different & simple interpretation of boosting (Friedman et al; 2000) • Exponential error function: • Usealternative approach • Instead of doing a global error function minimization • Fix m-1 classifiers & their coefficients, then minimize withrespect to m-th classifier and coefficient (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting 14.3.1 Minimizing exponential error (cont’d) For , we obtain Eq. (14.16, 17) Equivalent to minimizing Eq. (14.15)  (fromEq. (14.22))   Equivalent to Eq. (14.18) !!! • Rewrite error function • Weight update of data points • Classification of new data • Evaluate the sign of the combined function (14.21)  (14.19) !! • By omitting ½ factor (does not affect the sign) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.3 Boosting 14.3.2 Error functions for boosting ½ log -odds C.E. Exp. +: seq. minimization leads to simple AdaBoost scheme -: - penalize large neg. values of ty(x) much more than C.E. - less robust to outliers or misclassified data points - can’t be interpreted as the log-likelihood function Figure 14.4 Absolute errorvs.squared error Misclassification error Figure 14.3 • Exponential error function • Expected error: • With variational minimization of all possible functions y(x) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.4 Tree-based Models • Partition the input space into cuboid regions • Use axe aligned edges, simple model assignment • Modelcombination method • Only one model is responsible for making predictions at any given point in the input space • CART (ID3, C4.5, etc.) Optimal DT structure to minimize Sum-of-squares error  infeasible  use greedy optimization Readily interpretable Popular in medical diagnosis fields Tree growing Pruning Growing with stopping criteria Alternative measure for classification - Cross-entropy - Gini index Data sensitive & suboptimal Figure 14.5 Figure 14.6 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5 Conditional Mixture Models • Relax constraint of axis-aligned split •  Less interpretability • Allow soft, probabilistic splits that can be functions of all of the input variables • Hierarchical mixture of experts - Fully probabilistic tree-base model • Alternative way to motivate the H.M.Es. • Start with a std probabilistic mixtures of unconditional density models and replace the component densities with conditional distributions • Expert mixing coefficients • Independent of input variables (the simplest case) • Depend on the input variables  mixture of experts model • Hierarchical mixture of experts • Each component in the mixture model to be itself a mixture of experts (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5 Conditional Mixture Models 14.5.1 Mixtures of liner regression models Consider K linear regression models, governed by its own weight parameter wk Common noise variance, governed by a precision parameter Single target var t  Mixture distribution: Eq. 14.34  Log likelihood: Eq. 14.35 Use EM to maximize likelihood;  The joint dist over latent and observed variables * Complete-data log likelihood: E-step: Figure 14.7 responsibilities Expectation of thecomplete-data log likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5 Conditional Mixture Models 14.5.3 Mixtures of experts • Mixtures of experts model • Further increase the capability by allowing the mixing coefficients themselves to be function of the input variable so that •  Different components can model the dist. in different regions of input space (they are ‘experts’ at making predictions in their own regions) • Gating function  determine which components are dominant in which region experts Gating function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14.5 Conditional Mixture Models 14.5.3 Mixtures of experts (cont’d) • Limitation • The use of linear models for gating and expert functions •  for more flexible model, use • Multilevel gating function • ‘Hierarchical mixture of experts’, HME model • HME •  can be viewed as probabilistic version of decision trees • (Section 14.4) • Adv. of Mixture of experts • Can be optimized by EM in which the M step for each mixture component and gating model involves a convex optimization • Although the overall optimization is non-convex • Mixture density network (Section 5.6) • More flexible (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Presentation Transcript

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006.