Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University

Ch 3. Linear Models for Regression (2/2)Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Contents • 3.4 Bayesian Model Comparison • 3.5 The Evidence Approximation • 3.5.1 Evaluation of the evidence function • 3.5.2 Maximizing the evidence function • 3.5.3 Effective number of parameters • 3.6 Limitations of Fixed Basis Functions (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (1/3) • The problem of model selection from a Bayesian perspective • Over-fitting associated with maximum likelihood can be avoided by marginalizing over the model parameters instead of making point estimates of their values. • It also allow multiple complexity parameters to be determined simultaneously as part of the training process. (relevance vector machine) • The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model. • Posterior • : prior, a preference for different models. • : model evidence (marginal likelihood), the preference shown by the data for different models. Parameters have been marginalized out. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (2/3) • Bayes factor: the ratio of model evidences for two models • Predictive distribution: mixture distribution. Averaging the predictive distribution weighted by the posterior probabilities. • Model evidence • Sampling perspective: Marginal likelihood can be viewed as the probability of generating the data set D from a model whose parameters are sampled at random from the prior. • Posterior distribution over parameters • Evidence is the normalizing term that appears in the denominator when evaluating the posterior distribution over parameters (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (3/3) • Assume that the posterior distribution is sharply peaked around the most probable value wMAP. • For a model having a set of M parameters, • A simple model has little variability and so will generate data sets that are fairly similar to each other. • A complex model spreads its predictive probability over too broad a range of data sets and so assigns relatively small probability to any one of them. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Evidence Approximation (1/2) • Fully Bayesian treatment of linear basis function model • Hyperparameters: α, β. • Prediction: Marginalize w.r.t. hyperparameters as well as w. • Predictive distribution • If the posterior distribution is sharply peaked around values , the predictive distribution is obtained simply by marginalizing over w in which are fixed to the values . (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Evidence Approximation (2/2) • If the prior is relatively flat, • In the evidence framework the values of are obtained by maximizing the marginal likelihood function . • Hyperparameters can be determined from the training data alone from this method. (w/o recourse to cross-validation) • Recall that the ratio α/β is analogous to a regularization parameter. • Maximizing evidence • Set evidence function’s derivative equal to zero, re-estimate equations for α,β. • Use technique called the expectation maximization (EM) algorithm. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximizing the Evidence Function • Maximization of • Set derivative w.r.t α, β to zero. • w.r.t. α • ui and λi are eigenvector and eigenvalue described by • Maximizing hyperparameter • w.r.t. β (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Limitations of Fixed Basis Functions • Models comprising a linear combination of fixed, nonlinear basis functions. • Have closed-form solutions to the least-squares problem. • Have a tractable Bayesian treatment. • The difficulty • The basis functions are fixed before the training data set is observed, and is a manifestation of the curse of dimensionality. • Properties of data sets to alleviate this problem • The data vectors {xn} typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input space • Target variables may have significant dependence on only a small number of possible directions within the data manifold. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University