200 likes | 211 Views
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Contents. 11.0 Introduction 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo
E N D
Ch 11. Sampling ModelsPattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Contents • 11.0 Introduction • 11.1 Basic Sampling Algorithms • 11.2 Markov Chain Monte Carlo • 11.3 Gibbs Sampling • 11.4 Slice Sampling • 11.5 The Hybrid Monte Carlo Algorithm • 11.6 Estimating the Partition Function (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
0. Introduction • The problem: finding the expectation of some function f(z) w.r.t. a prob. dist. p(z). • Can be approximated by sampling independent points from the distribution p and summation. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
1. Basic Sampling Algorithms • Transformation method • Use a uniform generator and transform the output Can we get h-1 always? (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Rejection sampling • Assumptions • Sampling directly from target distribution p(z) is difficult. • Estimating p(z) is easy for any value of z. How to choose q? (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Adaptive rejection sampling • Construct qon the fly based on the measured values of p. • If a sample is rejected, it is added to the set of grid points and the q get refined. • Exponential decrease of acceptance rate with dimensionality (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Importance sampling • Directly approximate the expectation without sampling. • Motivation • The expectation can be approximated by finite summation. • But, the number of summation increases exponentially with dimensionality. • Not all regions of z space have significant p value. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sample from approximate dist. q weighted by p Depends on the choice of q Can produce error with no diagnostic indication (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Sampling-importance-resampling • It is difficult to set k in rejection sampling • Sampling from q. • Set weight on each sample as in importance sampling. • Resample from the samples. • Final samples approximate p as the sample size increases. Can get momentum at this step. Depends on the choice of q (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
1. 2. • Sampling and the EM algorithm • Sampling can be used to approximate the E step in EM algorithm: Monte Carlo EM algorithm. • IP algorithm • (I-Step, Imputation step, ~ E-Step) Sample from the joint posterior. • (P-Step, Posterior step, ~ M-Step) Compute a revised estimate of the posterior using samples from I-Step. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2. Markov Chain Monte Carlo • Allows sampling from a large class of distribution. • Scales well with the dimensionality of the sample space. • Basic Metropolis Algorithm • Maintain a record of state z(t) • Next state is sampled from q(z|z(t)) (q must be symmetric). • Candidate state from q is accepted with prob. • If rejected, current state is added to the record and becomes the next state. • Dist. of z tends to p in the infinity. • The original sequence is autocorrelated and get every Mth sample to get independent samples. For large M, the retained samples will be independent. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Random walk behavior • After t steps, the average distance covered by a random walk is proportional to the square root of t. • Very inefficient in exploring the state space. • To avoid this behavior is essential to MCMC. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Markov chains • Homogeneous • Invariant distribution • Detailed balance • Ergodicity • Equilibrium distribution (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Metropolis-Hastings algorithm • Generalization of Metropolis algorithm • q can be non-symmetric. • Accept prob. • P defined by Metropolis-Hastings algorithm is a invariant distribution. • The common choice for q is Gaussian • Step size vs. convergence time (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3. Gibbs Sampling • Simple and widely applicable • Special case of Metropolis-Hastings algorithm. • Each step replaces the value of one of the variables by a value drawn from the dist. of that variable conditioned on the values of the remaining variables. • The procedure • Initialize zi • For t=1,…,T • Sample (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
p is an invariant of each of Gibbs sampling steps and whole Markov chain. • At each step, the marginal dist. p(z\i) is invariant. • Each step correctly samples from the cond. dist. p(zi|z\i) • The Markov chain defined is ergodic. • The cond. dist. must be non-zero. • The Gibbs sampling correctly samples from p. • Gibbs sampling as an instance of Metropolis-Hastings algorithm. • A step involving zk in which z\k remain fixed. • Transition prob. qk(z*|z) = p(z*k|z\k) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Random walk behavior • The number of steps needed to obtain independent samples is of order (L/l)2. • Over-relaxation • The practical applicability depends on the ease of sampling from the conditional distributions. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4. Slice Sampling • Adaptive step size automatically adjusted to match the characteristics of the distribution. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5. The Hybrid Monte Carlo Algorithm • Hamiltonian dynamics • Joint distribution over phase space (z, r) with total energy as Hamiltonian H. • H is invariant : replace r by drawing from its conditional probability on z. • Hamiltonian dynamics + Metropolis algorithm • Updates the momentum by Markov chain. • Hamiltonian dynamical update by leapfrog algorithm. • Accept new state by min(1, exp{H(z, r) – H(z*, r*)}) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6. Estimating the Partition Function • Knowing the normalization constant in density function. • Estimating ratio of partition functions • Model comparison or model averaging • Importance sampling with energy function G. • Finding absolute value of the partition function for complex distribution: chaining. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/