1 / 17

Bayesian Reasoning: Tempering & Sampling

Bayesian Reasoning: Tempering & Sampling. A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au. Why does it work?.

skyler
Download Presentation

Bayesian Reasoning: Tempering & Sampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Reasoning:Tempering & Sampling A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

  2. Why does it work? With the Metropolis-Hastings algorithm, the desired posterior distribution (the stationary distribution of the Markov Chain) is recovered for a wide range of proposal distributions. For this, the chain must have three properties; Irreducibility: Given any starting point, the chain must be able to (eventually) jump to all states in the posterior distribution. Aperiodic: The chain must oscillate between two different states with a regular periodic motion (i.e. it gets stuck in the oscillation for ever). http://www.physics.usyd.edu.au/~gfl/Lecture

  3. Why does it work? Positive Recurrent: This basically means that the posterior distribution exists, such that if an initial value X0 samples (X) then all subsequent iterations will also sample (X) . These can be shown for Metropolis-Hastings e.g.; This is the Detailed Balance equation. http://www.physics.usyd.edu.au/~gfl/Lecture

  4. Where can we go wrong? Our posterior distribution may be multi-modal, with several significant peaks. Given enough time, our MCMC walk through the probability space will eventually cover the entire volume. However, the walk may stay on one peak for a significant period before moving to the next. If we have only a certain amount of time (i.e. a three year PhD), how can we ensure that we have appropriately sampled the space and that the MCMC chain truly reflects the underlying posterior distribution? If it does not, properties you draw from the sample will be biased. http://www.physics.usyd.edu.au/~gfl/Lecture

  5. Simulated Tempering The problem is similar to ensuring you find a global minimum in optimization problems; one approach, simulated annealing, allows a solution to “cool” into the global minimum. We can take a similar approach with out MCMC, heating up the posterior distribution (to make it flatter) and then cooling it down. When hotter, the MCMC can hop out of local regions of significant probability and explore more of the volume, then cooling down again into regions of interest. We start with Bayes’ theorem such that http://www.physics.usyd.edu.au/~gfl/Lecture

  6. Simulated Tempering We can construct a flatter distribution through Typically, a discrete set of tempering parameters, , are used, with =1 (the “cold sampler”) being the target distribution. We can “random walk” through the temperature, and consider only those steps taken when =1 to represent our target distribution. However, parallel tempering provides a similar, but more efficient, approach to exploring the posterior distribution. http://www.physics.usyd.edu.au/~gfl/Lecture

  7. Parallel Tempering Parallel tempering uses a series of MCMC explorations of the posterior distribution, each at a different tempering parameter,i; those at high temperature will hop all over the space, while those at colder temperature will take a more sedate walk. Typically, the temperatures are distributed over a ladder i = {1=1, 2, …, n}. The goal of parallel tempering is to take parallel chains and consider swapping them. Suppose we choose a swap to take place once every ns steps in the chain, the proposal to make a swap can be undertaken by choosing a uniform random number and considering a swap if U11/ns . If we choose to swap, two chains are chosen, one at i and in state Xt,i, and the other at i+1and in state Xt,i+1. http://www.physics.usyd.edu.au/~gfl/Lecture

  8. Parallel Tempering We can then choose to swap with a probability by again selecting a uniform random number between 0 & 1 and choosing to swap if U(0,1) ≤ r. The swaps move information between the parallel chains at different temperatures. As ever, the choice ofi depends on experimentation and experience. http://www.physics.usyd.edu.au/~gfl/Lecture

  9. An example Earlier, we examined the comparison between two modelsfor some spectral data. Here, we look at the results of a Metropolis-Hastings and parallel tempering analysis of this problem. To match the earlier analysis; A Jeffreys prior was used for T between 0.1mK and 100mK. A uniform prior was used for  between channel 1 and 44. The proposal for both parameters was Gaussian with =1. http://www.physics.usyd.edu.au/~gfl/Lecture

  10. An example After a distinct burn-in, the chain wanders through the parameter space, but it clearly prefers T~1 and ~38, although significant departures are apparent. http://www.physics.usyd.edu.au/~gfl/Lecture

  11. An example However, it is interesting to examine the marginalized distributions compared to the numerical integration results obtained earlier. While the M-H approach has nicely recovered the distribution in T, and has captured the strong peak in , the chain has clearly failed to characterize the structure in the posterior at low channel numbers, not spending enough time in regions with <30. http://www.physics.usyd.edu.au/~gfl/Lecture

  12. An example Here is the =1 chain for the parallel tempering run (with five evenly-spaced  between 0.01 and 1, and swaps considered every 50 steps (on average)). http://www.physics.usyd.edu.au/~gfl/Lecture

  13. An example The difference is quite apparent in the marginalized distributions. Again, T and the strong peak in  are well characterized, but the application of parallel tempering has also well sampled channel numbers with <30, better recovering the underlying distribution. http://www.physics.usyd.edu.au/~gfl/Lecture

  14. Model Comparison Remember, to compare models and to deduce which is more probable, we calculate the odds ratio; Where the final term, B12, is the Bayes factor. Suppose we have the same two competing models for the spectral line data, one with no parameters (so the Bayes factor can be calculated analytically), and the other which we have analyzed with parallel tempering. How do we calculate the Bayes factor for the latter? http://www.physics.usyd.edu.au/~gfl/Lecture

  15. Model Comparison What we want to calculate is; We can combine the information in the parallel tempering chains through the relation (read Chap 12.7); where http://www.physics.usyd.edu.au/~gfl/Lecture

  16. Model Comparison Here are the results for the analysis of the spectral line model. There are only five points in  and so we need to interpolate between the points (this is a Matlab spline). Of course, we would prefer more samples in . The result of the integral yield ln[ p(D|M1,I)] = -87.3369, with a resultant Bayes factor of B12=1.04 (similar to the result obtained earlier from the analytic calculation). http://www.physics.usyd.edu.au/~gfl/Lecture

  17. Nested Sampling There are other ways to analyze the posterior and the likelihood space (with more efficient and faster approaches). One, of these, nested sampling, iteratively re-samples the space and slice it into regions of likelihood; Brendon will discuss this in more detail in his final lecture. Figures form www.inference.phy.cam.ac.uk/bayesys/box/nested.ps http://www.physics.usyd.edu.au/~gfl/Lecture

More Related