110 likes | 221 Views
Remember that our objective is for some density f( y | ) for observations where y and are vectors of data and parameters, being sampled from a prior () then the posterior is.
E N D
Remember that our objective is for some density f(y|) for observations where y and are vectors of data and parameters, being sampled from a prior () then the posterior is If we can get a formula for this (e.g. have this as an explicit function of y and ) then we are done. But usually we can’t compute this particularly when y and are multidimensional. So instead what we will do is estimate the distribution of the posterior by taking lots of samples from it. From this we can then do Bayesian estimation.
Idea of MCMC: • We are going to set up a Markov Chain for which we can compute samples and which has the posterior density as it’s stationary distribution. There are several different ways to do this, but in all cases there are two main issues: • Once we can be sure the MC indeed does have the posterior as its stationary distribution, we need a stopping criteria to ensure we have indeed reached stationarity. (convergence diagnosis) • The samples from the MC are correlated, rather than being iid draws from the posterior, so we need some way to relate how the variance in the estimators (e.g. of moments of the posterior) compares to that of the actual posterior (variance estimation)
How do we set up an appropriate MC? Gibbs Sampling: This is useful in the situation which we can sample from conditional distributions but we don’t have the full or marginal distribution for the posterior. The assumption is (usually justified) that the full joint posterior distribution is determined uniquely if we know all conditional distributions, and from this we can find all marginal distributions as well. So for a posterior p(|y) where = (1, …,k) the conditional distributions are p(i| j≠i,y) i = 1, …,k and if we know all these conditional distributions, e.g. we can sample from them, then we can use this to obtain samples from the full joint prior p(|y) and sample from the marginal posteriors p(i|y)
The Gibbs sampler: to find this we need a way to get a MC, meaning finding an update of the current state of the MC to the next state – this will not be deterministic since we are sampling from distributions, so what we are doing is specifying the transition function for the MC using the conditional distributions of the posterior. • Procedure: • Start with initial values of (2(0), …,k(0)) chosen from the prior density • Draw 1(1) from p(1| 2(0),…, k(0),y) • Draw 2(1) from p(2| 1(1), 3(0),…, k(0),y) • And proceed in similar way to have a sample • (1(1),2(1), …,k(1))
So the above procedure provides a way to get a new update of the parameters from an initial set – the “state” of the MC is the vector (1(1),2(1), …,k(1)) and we continue this same procedure to update the state for t = 2, …, T Then we have a stopping criteria to say that when we have t = t0, …, T then we are near stationary distribution so these vectors ((t) , t = t0, …, T ) are a sample (they are correlated though since they are taken as a particular sample trajectory of the MC) of the stationary distribution of the MC and thus are a sample from the posterior. So for example, we could use (1 (t) , t = t0, …, T ) to get a histogram of values that are a sample of the marginal posterior density for 1 , p(1 | y ) and from this get estimate of the posterior mean
There’s no reason to limit this to a single chain though – it would be typical to start several trajectories for the MC from different initial values. This is a typical approach to try to determine and appropriate burn-in period (e.g. the value t0 ). You look at some estimate from the MC trajectory (such as the above mean) and then compare the variation within a single trajectory of this parameter varies as compared to the variation across the trajectories. Due to ergodicity of the MC, if we are near stationary distribution, variation across trajectories should be identical to variation within trajectories (see the book for scale reduction factor)
How many calculations does the above need? We need k samples from various conditional distributions for each transition step of the MC. We do T of these, so we need a total of kT samples from the conditional distributions and if we do a total of m trajectories, and sample for each k values from the prior then we have mk(T+1) total samples to take, each of which might require several pseudo-random number draws.
Bivariate example – case with k=2 parameters a,b • Procedure: • Start with initial values of (a,b) chosen from the prior density • Draw a’ from p(a|b,y) • Draw b’ from p(b|a’,y) • Then the new state of the MC is (a’,b’) and note that as we continue this, if we are at stationary distribution, the probability of drawing a particular b’ given a’ will be the same across iterations.
The Gibbs sampler has a variety of variations (see the handout) but all assume that we have a reasonably feasible way to sample from all conditional distributions. When this is not feasible, there are other approaches an the Metropolis-Hastings sampler is one: The idea here is to add a rejection criteria in which we reject some samples under certain criteria. This is a standard approach in using pseudo-random number generators to develop a sample from a probability distribution where we cannot easily sample from the distribution. This goes back to von-Neumann. A simple example is if you want to sample points uniformly in the unit circle, then you independently choose two pseudo-random numbers x, y uniform on [-1,1], and accept them if x2+y2 ≤1 and reject them otherwise.
The Metropolis-Hasting algorithm: we wish to generate a MC for the posterior p() and suppose there is the transition kernel q(,φ) which we know how to sample from. • Procedure: • Start with initial value of (0) chosen from the prior density • Generate proposed value φ from q((0),φ) • Evaluate acceptance probability α((0),φ) • Put (1)=φ with probability α((0),φ) and otherwise put (1)=(0) • Repeat
The above procedure provides a MC and the handout shows that it has p() as a stationary distribution. The trick is to choose the transition kernel –there are several simplified cases: • If we choose q(,φ)=q(φ, ) then this is Metropolis sampler and acceptance depends only on ratio – accept if step is to higher density, accept step to lower density depending upon ratio of densities. • Random walks – φ = (j) + wj where wj are iid with some density. • Independent chains - q(,φ)=f(φ) for some density f( ) so is independent of state of chain and if choose prior for proposal density then get ratio of likelihoods in acceptance criteria