780 likes | 800 Views
Sampling Bayesian Networks. ICS 295 2008. Algorithm Tree. Sampling Fundamentals. Given a set of variables X = {X 1 , X 2 , … X n }, a joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :. Sampling From (X).
E N D
Sampling Bayesian Networks ICS 295 2008
Sampling Fundamentals Given a set of variables X = {X1, X2, … Xn}, a joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :
Sampling From (X) A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:
Sampling Challenge • It is hard to generate samples from (X) • Trade-Offs: • Generate samples from Q(X) • Forward Sampling, Likelyhood Weighting, IS • Try to find Q(X) close to (X) • Generate dependent samples forming a Markov Chain from P’(X)(X) • Metropolis, Metropolis-Hastings, Gibbs • Try to reduce dependence between samples
Markov Chain • A sequence of random values x0,x1,… , defined on a finite state space D(X), is called a Markov Chain if it satisfies the Markov Property: • If P(xt+1 =y |xt) does not change with t (time homogeneous), then it is often expressed as a transition function, A(x,y) Liu, Ch.12, p 245
Markov Chain Monte Carlo • First, define a transition probability P(xt+1=y|xt) • Pick initial state x0, usually not important because it becomes “forgotten” • Generate samples x1, x2,… sampling each next value from P(X| xt) x0 x1 xt xt+1 • If we choose proper P(xt+1|xt), we can guarantee that the distribution represented by samples x0,x1,… converges to (X)
Markov Chain Properties • Irreducibility • Periodicity • Recurrence • Revercibility • Ergodicity • Stationary Distribution
Irreducible • A station x is said to be irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps. • If on state is irreducible, then all the sates must be irreducible. Liu, Ch. 12, pp. 249, Def. 12.1.1
Aperiodic • A state x is aperiodic if the greatest common divider of {n : An(x,x) > 0} is 1. • If state x is aperiodic and the chain is irreducible, then every state must be aperiodic. Liu, Ch. 12, pp.240-250, Def. 12.1.1
Recurrence • A state x is recurrent if the chain returns to x with probability 1 • State x is recurrentif and only if: • Let M(x) be the expected number of steps to return to state x • State x is positive recurrent if M(x) is finite • The recurrent states in a finite state chain are positive recurrent.
Ergodicity • A state x is ergodic if it is aperiodic and positive recurrent. • If all states in a Markov chain are ergodic then the chain is ergodic.
Reversibility • Detail balance condition: • Markov chain is reversible if there is a such that: • For a reversible Markov chain, is always a stationary distribution.
Stationary Distribution • If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariantorequilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy: • An irreducible chain has a stationary distributionif and only if all of its states are positive recurrent. The distribution is unique.
Stationary Distribution In Finite State Space • Stationary distribution always exists but may not be unique • If a finite-state Markov chain is irreducible and aperiodic, it is guaranteed to be unique and A(n)=P(xn = y | x0) converges to a rank-one matrix in which each row is the stationary distribution . • Thus, initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution • However, it is important how long it takes to forget it!
Convergence Theorem • Given a finite state Markov Chain whose transition function is irreducible and aperiodic, then An(x0,y) converges to its invariant distribution (y) geometrically in variation distance, then there exists a 0 < r < 1 and c > 0 s.t.:
Eigen-Value Condition • Convergence to stationary distribution is driven by eigen-values of matrix A(x,y). • “The chain will converge to its unique invariant distribution if and only if matrix A’s second largest eigen-value in modular is strictly less than 1.” • Many proofs of convergence are centered around analyzing second eigen-value. Liu, Ch. 12, p. 249
Convergence In Finite State Space • Assume a finite-state Markov chain is irreducible and aperiodic • Initial state x0 is not important for convergence: it gets forgotten and we start sampling from target distribution • However, it is important how long it takes to forget it! – known as burn-in time • Since the first k states are not drown exactly from , they are often thrown away. Open question: how big a k ?
Sampling in BN • Same Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Challenge: X is a vector and P(X) is a huge distribution represented by BN • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(E=e) and P(Xi|e) ?
Sampling Algorithms • Forward Sampling • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Likelihood Weighting • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks
Gibbs Sampling • Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Transition probability equals the conditional distribution • Example: (X,Y), A(xt+1|yt)=P(x|y), A(yt+1|xt) = P(y|x) y1 y0 x0 x1
Gibbs Sampling for BN • Samples are dependent, form Markov Chain • Sample from P’(X|e)which converges toP(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality
Gibbs Sampling (Pearl, 1988) • A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples
Ordered Gibbs Sampler Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order
Gibbs Sampling (cont’d)(Pearl, 1988) Markov blanket:
Ordered Gibbs Sampling Algorithm Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)
Answering Queries • Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):
Gibbs Sampling Example - BN X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7
Gibbs Sampling Example - BN X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7
Gibbs Sampling Example - BN X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7
Gibbs Sampling Example - BN X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7
Gibbs Sampling Example – Init Initialize nodes with random values: X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 • Initialize Running Sums: SUM1 = 0 SUM2 = 0 SUM3 = 0 SUM4 = 0 SUM5 = 0 SUM6 = 0 SUM7 = 0 SUM8 = 0
Gibbs Sampling Example – Step 1 • Generate Sample 1 • compute SUM1 += P(x1| x20, x30, x40, x50, x60, x70, x80, x9 ) • select and assign new value X1 = x11 • compute SUM2 += P(x2| x11, x30, x40, x50, x60, x70, x80, x9 ) • select and assign new value X2 = x21 • compute SUM3 += P(x2| x11, x21, x40, x50, x60, x70, x80, x9 ) • select and assign new value X3 = x31 • ….. • At the end, have new sample: S1 = {x11, x21, x41, x51, x61, x71, x81, x9}
Gibbs Sampling Example – Step 2 • Generate Sample 2 • Compute P(x1| x21, x31, x41, x51, x61, x71, x81, x9 ) • select and assign new value X1 = x11 • update SUM1 += P(x1| x21, x31, x41, x51, x61, x71, x81, x9 ) • Compute P(x2| x12, x31, x41, x51, x61, x71, x81, x9 ) • select and assign new value X2 = x21 • update SUM2 += P(x2| x12, x31, x41, x51, x61, x71, x81, x9 ) • Compute P(x3| x12, x22, x41, x51, x61, x71, x81, x9 ) • select and assign new value X3 = x31 • compute SUM3 += P(x2| x12, x22, x41, x51, x61, x71, x81, x9 ) • ….. • New sample: S2 = {x12, x22, x42, x52, x62, x72, x82, x9}
Gibbs Sampling Example – Answering Queries P(x1|x9) = SUM1 /2 P(x2|x9) = SUM2 /2 P(x3|x9) = SUM3 /2 P(x4|x9) = SUM4 /2 P(x5|x9) = SUM5 /2 P(x6|x9) = SUM6 /2 P(x7|x9) = SUM7 /2 P(x8|x9) = SUM8 /2
pij pij > 0 Si Sj Gibbs Convergence • Stationary distribution = target sampling distribution • MCMC converges to the stationary distribution if network is ergodic • Chain is ergodic if all probabilities are positive • If i,j such that pij = 0 , then we may not be able to explore full sampling space !
Gibbs Sampling: Burn-In • We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample values from approximate P(x|e) (for example, run IBP first)
Gibbs Sampling: Performance +Advantage: guaranteed to converge to P(X|E) -Disadvantage: convergence may be slow Problems: • Samples are dependent ! • Statistical variance is too big in high-dimensional problems
Gibbs: Speeding Convergence Objectives: • Reduce dependence between samples (autocorrelation) • Skip samples • Randomize Variable Sampling Order • Reduce variance • Blocking Gibbs Sampling • Rao-Blackwellisation
Skipping Samples • Pick only every k-th sample (Gayer, 1992) Can reduce dependence between samples ! Increases variance ! Waists samples !
Randomized Variable Order Random Scan Gibbs Sampler Pick each next variable Xi for update at random with probability pi , i pi = 1. (In the simplest case, pi are distributed uniformly.) In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance Reduction”)
Blocking • Sample several variables together, as a block • Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: Xt+1 P(yt,zt)=P(wt) (yt+1,zt+1)=Wt+1 P(xt+1) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!
Blocking Gibbs Sampling Jensen, Kong, Kjaerulff, 1993 “Blocking Gibbs Sampling Very Large Probabilistic Expert Systems” • Select a set of subsets: E1, E2, E3, …, Ek s.t. Ei X Ui Ei = X Ai = X \ Ei • Sample P(Ei | Ai)
Rao-Blackwellisation • Do not sample all variables! • Sample a subset! • Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample: Xt+1 P(x|yt) yt+1 P(y|xt+1)
Rao-Blackwell Theorem Bottom line: reducing number of variables in a sample reduce variance!
Blocking vs. Rao-Blackwellisation • Standard Gibbs: P(x|y,z),P(y|x,z),P(z|x,y) (1) • Blocking: P(x|y,z), P(y,z|x) (2) • Rao-Blackwellised: P(x|y), P(y|x) (3) Var3 < Var2 < Var1 [Liu, Wong, Kong, 1994 Covariance structure of the Gibbs sampler…] X Y Z
Rao-Blackwellised Gibbs: Cutset Sampling • Select C X(possibly cycle-cutset), |C| = m • Fix evidence E • Initialize nodes with random values: For i=1 to m: ci to Ci = c0i • For t=1 to n , generate samples: For i=1 to m: Ci=cit+1 P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)
Cutset Sampling • Select a subset C={C1,…,CK} X • A sample t[1,2,…],is an instantiation of C: • Sampling process • Fix values of observed variables e • Generate sample c0 at random • Generate samples c1,c2,…cT from P(c|e) • Compute posteriors from samples
Cutset SamplingGenerating Samples Generate sample ct+1 from ct : In short, for i=1 to K: