1 / 48

Bayesian Inference

Bayesian Inference. The Reverend Thomas Bayes (1702-1761). Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH. With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen. Why do I need to learn about Bayesian stats?.

tavita
Download Presentation

Bayesian Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Inference The Reverend Thomas Bayes (1702-1761) Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH With many thanks for materials to: Klaas Enno Stephan & Kay H. Brodersen

  2. Why do I need to learn about Bayesian stats? Because SPM is getting more and more Bayesian: • Segmentation & spatial normalisation • Posterior probability maps (PPMs) • Dynamic Causal Modelling (DCM) • Bayesian Model Selection (BMS) • EEG: source reconstruction

  3. Classical and Bayesian statistics p-value: probability of getting the observed data in the effect’s absence. If small, reject null hypothesis that there is no effect. • One can never accept the null hypothesis • Given enough data, one can always demonstrate a significant effect • Correction for multiple comparisons necessary Probability of observing the data y, given no effect ( = 0). Bayesian Inference • Flexibility in modelling • Incorporating prior information • Posterior probability of effect • Options for model comparison Statistical analysis and the illusion of objectivity. James O. Berger, Donald A. Berry

  4. Bayes‘ Theorem PosteriorBeliefs Observed Data Prior Beliefs Reverend Thomas Bayes 1702 - 1761 “Bayes‘ Theorem describes, how an ideally rational person processes information." Wikipedia

  5. Bayes’ Theorem Given data yand parameters , the conditional probabilities are: Eliminating p(y,) gives Bayes’ rule: Likelihood Prior Posterior Evidence

  6. Bayesian statistics new data prior knowledge posterior  likelihood ∙ prior Bayes theorem allows one to formally incorporate prior knowledge into computing statistical probabilities. Priors can be of different sorts:empirical, principled or shrinkage priors, uninformative. The “posterior” probability of the parameters given the data is an optimal combination of prior knowledge and new data, weighted by their relative precision.

  7. Bayes in motion - an animation

  8. Principles of Bayesian inference • Formulation of a generative model Likelihood function p(y|) Model prior distribution p() • Observation of data y Data • Model Inversion - Updateof beliefs based upon observations, given a prior state of knowledge Maximum a posteriori (MAP) Maximum likelihood (ML)

  9. Conjugate Priors • Prior and Posterior have the same form Same form !! • Analytical expression. • Conjugate priors for all exponential family members. • Example – Gaussian Likelihood , Gaussian prior over mean

  10. Gaussian Model Likelihood & prior Posterior: Posterior Likelihood Prior Relative precision weighting

  11. Bayesian regression: univariate case Normal densities Univariate linear model Relative precision weighting

  12. Bayesian GLM: multivariate case Normal densities General Linear Model 2 1 • One step if Ce is known. • Otherwise define conjugate prior or perform iterative estimation with EM.

  13. An intuitive example

  14. Pitt & Miyung (2002), TICS Bayesian model selection (BMS) Given competing hypotheses on structure & functional mechanisms of a system, which model is the best? Which model represents thebest balance between model fit and model complexity? For which model m does p(y|m) become maximal?

  15. Bayesian model selection (BMS) Model evidence: Bayes’ rule: accounts for both accuracy and complexity of the model allows for inference about structure (generalizability) of the model Model comparison via Bayes factor: Model averaging Kass and Raftery (1995), Penny et al. (2004) NeuroImage

  16. Bayesian model selection (BMS) Various Approximations: • Akaike Information Criterion (AIC) – Akaike, 1974 • Bayesian Information Criterion (BIC) – Schwarz, 1978 • Negative free energy ( F ) • A by-product of Variational Bayes

  17. Bayesian inference formalizes model inversion, the process of passing from a prior to a posterior in light of data. Approximate Bayesian inference likelihood prior posterior marginal likelihood (model evidence) In practice, evaluating the posterior is usually difficult because we cannot easily evaluate , especially when: • High dimensionality, complex form • analytical solutions are not available • numerical integration is too expensive Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  18. There are two approaches to approximate inference. They have complementary strengths and weaknesses. Approximate Bayesian inference Deterministicapproximate inference in particular variational Bayes  find an analytical proxy that is maximally similar to  inspect distribution statistics of (e.g., mean, quantiles, intervals, …) Stochasticapproximate inference in particular sampling  design an algorithm that draws samples from  inspect sample statistics (e.g., histogram, sample quantiles, …)  asymptotically exact  computationally expensive  tricky engineering concerns  often insightful and fast  often hard work to derive • converges to local minima Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  19. The Laplace approximation provides a way of approximating a density whose normalization constant we cannot evaluate, by fitting a Normal distribution to its mode. The Laplace approximation Pierre-Simon Laplace(1749 – 1827) French mathematicianand astronomer normalization constant (unknown) main part of the density (easy to evaluate) This is exactly the situation we face in Bayesian inference: model evidence (unknown) joint density (easy to evaluate) Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  20. Given a model with parameters , the Laplace approximation reduces to a simple three-step procedure: Applying the Laplace approximation Find the mode of the log-joint:  Evaluate the curvature of the log-joint at the mode:  We obtain a Gaussian approximation: with  Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  21. The Laplace approximation is often too strong a simplification. Limitations of the Laplace approximation becomes brittle when the posterior is multimodal ignores global properties of the posterior only directly applicable to real-valued parameters Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  22. Variational Bayesian (VB) inference generalizes the idea behind the Laplace approximation. In VB, we wish to find an approximate density that is maximally similar to the true posterior. Variational Bayesian inference true posterior hypothesis class divergence best proxy Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  23. Variational Bayesian inference is based on variational calculus. Variational calculus • Standard calculus • Newton, Leibniz, and others • functions • derivatives • Example: maximize the likelihood expression w.r.t. • Variational calculus • Euler, Lagrange, and others • functionals • derivatives • Example: maximize the entropy w.r.t. a probability distribution Leonhard Euler(1707 – 1783) Swiss mathematician, ‘Elementa Calculi Variationum’ Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  24. Variational calculus lends itself nicely to approximate Bayesian inference. Variational calculus and the free energy divergence between and free energy Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  25. Variational calculus and the free energy In summary, the log model evidence can be expressed as: divergence (unknown) free energy(easy to evaluate for a given ) Maximizing is equivalent to: • minimizing • tightening as a lowerbound to the log model evidence … convergence initialization … Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  26. We can decompose the free energy as follows: Computing the free energy expected log-joint Shannon entropy Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  27. When inverting models with several parameters, a common way of restricting the class of approximate posteriors is to consider those posteriors that factorize into independent partitions, where is the approximate posterior for the th subset of parameters. The mean-field assumption Jean Daunizeau, www.fil.ion.ucl.ac.uk/ ~jdaunize/presentations/Bayes2.pdf Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  28. In summary: Suppose the densities are kept fixed. Then the approximate posterior that maximizes is given by: Therefore: Variational algorithm under the mean-field assumption This implies a straightforward algorithm for variational inference:  Initialize all approximate posteriors , e.g., by setting them to their priors.  Cycle over the parameters, revising each given the current estimates of the others.  Loop until convergence. Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  29. Typical strategies in variational inference Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  30. We are given a univariate dataset which we model by a simple univariate Gaussian distribution. We wish to infer on its mean and precision: Although in this case a closed-form solution exists*, we shall pretend it does not. Instead, we consider approximations that satisfy the mean-field assumption: Example: variational density estimation mean precision data 10.1.3; Bishop (2006) PRML Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  31. Univariate normal distribution Multivariate normal distribution Gamma distribution Recurring expressions in Bayesian inference Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  32. Variational density estimation: mean reinstation by inspection with Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  33. Variational density estimation: precision with Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  34. Variational density estimation: illustration Bishop (2006) PRML, p. 472 Source: Kay H. Brodersen, 2013, http://people.inf.ethz.ch/bkay/talks/Brodersen_2013_03_22.pdf

  35. Markov Chain Monte Carlo(MCMC) sampling Posterior distribution Proposal distribution • A general framework for sampling from a large class of distributions • Scales well with dimensionality of sample space • Asymptotically convergent q() 0 1 2 3 N …………………...

  36. Reversible Markov chain properties • Transition probabilities – homogeneous/invariant • Invariance • Ergodicity

  37. Metropolis-Hastings Algorithm • Initialize at step 1 - for example, sample from prior • At step t, sample from the proposal distribution: • Accept with probability: • Metropolis – Symmetric proposal distribution t) Bishop (2006) PRML, p. 539

  38. Gibbs Sampling Algorithm • Special case of Metropolis Hastings • At step t, sample from the conditional distribution: • Acceptance probability = 1 • Blocked Sampling • ) • : • : • : • )

  39. Posterior analysis from MCMC 0 1 2 3 N …………………... Obtain independent samples: • Generate samples based on MCMC sampling. • Discard initial “burn-in” period samples to remove dependence on initialization. • Thinning- select every mth sample to reduce correlation . • Inspect sample statistics (e.g., histogram, sample quantiles, …)

  40. Summary

  41. References • Pattern Recognition and Machine Learning (2006). Christopher M. Bishop • Bayesian reasoning and machine learning. David Barber • Information theory, pattern recognition and learning algorithms. David McKay • Conjugate Bayesian analysis of the Gaussian distribution. Kevin P. Murphy. • Videolectures.net – Bayesian inference, MCMC , Variational Bayes • Akaike, H (1974). A new look at statistical model identification. IEEE transactions on Automatic Control 19, 716-723. • Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461-464. • Kass, R.E. and Raftery, A.E. (1995).Bayes factors Journal of the American Statistical Association, 90, 773-795. • Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition (Chapman & Hall/CRC Texts in Statistical Science) DaniGamerman, Hedibert F. Lopes. • Statistical Parametric Mapping: The Analysis of Functional Brain Images.

  42. Thank You http://www.translationalneuromodeling.org/tapas/

More Related