140 likes | 239 Views
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Updated and presented by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Contents.
E N D
Ch 3. Linear Models for Regression (2/2)Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Updated and presented by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Contents • 3.4 Bayesian Model Comparison • 3.5 The Evidence Approximation • 3.5.1 Evaluation of the evidence function • 3.5.2 Maximizing the evidence function • 3.5.3 Effective number of parameters • 3.6 Limitations of Fixed Basis Functions (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayesian Model Comparison (1/4) • Model selection from a Bayesian perspective • Over-fitting associated with maximum likelihood can be avoided by marginalizing over the model parameters instead of making point estimates of their values. • It also allow multiple complexity parameters to be determined simultaneously as part of the training process. (relevance vector machine) • The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model. • Posterior • : prior, a preference for different models. • : model evidence (marginal likelihood), the preference shown by the data for different models. Parameters have been marginalized out. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayesian Model Comparison (2/4) • Bayes factor: the ratio of model evidences for two models • Predictive distribution: mixture distribution. Averaging the predictive distribution weighted by the posterior probabilities. • Model evidence • Sampling perspective: Marginal likelihood can be viewed as the probability of generating the data set D from a model whose parameters are sampled at random from the prior. • Posterior distribution over parameters • Evidence is the normalizing term that appears in the denominator when evaluating the posterior distribution over parameters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayesian Model Comparison (3/4) • Consider the case of a model having a single parameter w. • Assume that the posterior distribution is sharply peaked around the most probable value wMAP, with width . • Assume that the prior is plat, then • The first term represents the fit to the data given by the most probable parameter value, and for a flat prior this would correspond to the log likelihood. • The second term penalizes the model according to its complexity, because this term is negative. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayesian Model Comparison (4/4) • For a model having a set of M parameters, • As we increase the complexity of the model, the first term will typically decrease, because a more complex model is better able to fit the data. • Whereas the second term will increase due to the dependence on M. • The optimal model complexity, as determined by the maximum evidence, will be given by a trade-off between these two competing terms. • A simple model has little variability and so will generate data sets that are fairly similar to each other. • A complex model spreads its predictive probability over too broad a range of data sets and so assigns relatively small probability to any one of them. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
The Evidence Approximation (1/2) • Fully Bayesian treatment of linear basis function model • Hyperparameters: α, β. • Prediction: Marginalize w.r.t. hyperparameters as well as w. • Predictive distribution • If the posterior distribution is sharply peaked around values , the predictive distribution is obtained simply by marginalizing over w in which are fixed to the values . (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
The Evidence Approximation (2/2) • If the prior is relatively flat, • In the evidence framework the values of are obtained by maximizing the marginal likelihood function . • Hyperparameters can be determined from the training data alone from this method. (w/o recourse to cross-validation) • Recall that the ratio α/β is analogous to a regularization parameter. • Maximizing evidence • Set evidence function’s derivative equal to zero, re-estimate equations for α,β. • Use technique called the expectation maximization (EM) algorithm. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Evaluation of the Evidence Function • Marginal likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Evaluation of the Evidence Function Model evidence (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Maximizing the Evidence Function • Maximization of • Set derivative w.r.t α, β to zero. • w.r.t. α • ui and λi are eigenvector and eigenvalue described by • Maximizing hyperparameter • w.r.t. β (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Effective Number of Parameters (1/2) • γ: effective total number of well determined parameters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Optimal α Log evidence Test err. Effective Number of Parameters (2/2) Optimal α (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Limitations of Fixed Basis Functions • Models comprising a linear combination of fixed, nonlinear basis functions. • Have closed-form solutions to the least-squares problem. • Have a tractable Bayesian treatment. • The difficulty • The basis functions are fixed before the training data set is observed, and is a manifestation of the curse of dimensionality. • Properties of data sets to alleviate this problem • The data vectors {xn} typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input space • Target variables may have significant dependence on only a small number of possible directions within the data manifold. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/