Sampling Bayesian Networks

Sampling Bayesian Networks ICS 275b 2005

Approximation Algorithms Structural Approximations • Eliminate some dependencies • Remove edges • Mini-Bucket Approach Search Approach for optimization tasks: MPE, MAP Sampling Generate random samples and compute values of interest from samples, not original network

Algorithm Tree

Sampling • Input: Bayesian network with set of nodes X • Sample = a tuple with assigned values s=(X1=x1,X2=x2,… ,Xk=xk) • Tuple may include all variables (except evidence) or a subset • Sampling schemas dictate how to generate samples (tuples) • Ideally, samples are distributed according to P(X|E)

Sampling • Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(Xi|E) ?

Sampling Algorithms • Forward Sampling • Likelyhood Weighting • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

Forward Sampling • Forward Sampling • Case with No evidence • Case with Evidence • N and Error Bounds

Forward Sampling No Evidence(Henrion 1988) Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: • For t = 0 to T • For i = 0 to N • Xi sample xit from P(xi | pai)

r 0 0.3 1 Sampling A Value What does it mean to sample xit from P(Xi | pai) ? • Assume D(Xi)={0,1} • Assume P(Xi | pai) = (0.3, 0.7) • Draw a random number r from [0,1] If r falls in [0,0.3], set Xi = 0 If r falls in [0.3,1], set Xi=1

Sampling a Value • When we sample xit from P(Xi | pai), most of the time, will pick the most likely value of Xi occasionally, will pick the unlikely value of Xi • We want to find high-probability tuples But!!!…. • Choosing unlikely value allows to “cross” the low probability tuples to reach the high probability tuples !

Forward sampling (example)

Forward Sampling-Answering Queries Task: given n samples {S1,S2,…,Sn} estimate P(Xi = xi) : Basically, count the proportion of samples where Xi = xi

Forward Sampling w/ Evidence Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E • For t=1 to T • For i=1 to N • Xi sample xit from P(xi | pai) • If Xi in E and Xi xi, reject sample: • i = 1 and go to step 2

Forward Sampling: Illustration Let Y be a subset of evidence nodes s.t. Y=u

Forward Sampling –How many samples? Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Chebychev’s Bound.

Forward Sampling - How many samples? Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Hoeffding’s Bound (full proof is given in Koller).

Forward Sampling:Performance Advantages: • P(xi | pa(xi)) is readily available • Samples are independent ! Drawbacks: • If evidence E is rare (P(e) is low), then we will reject most of the samples! • Since P(y) in estimate of N is unknown, must estimate P(y) from samples themselves! • If P(e) is small, T will become very big!

Problem: Evidence • Forward Sampling • High Rejection Rate • Fix evidence values • Gibbs sampling (MCMC) • Likelyhood Weighting • Importance Sampling

Forward Sampling Bibliography • {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990) “Clamping” evidence+ forward sampling+ weighing samples by evidence likelihood Works well for likelyevidence!

Likelihood Weighting

Likelihood Weighting where

Likelyhood Convergence(Chebychev’s Inequality) • Assume P(X=x|e) has mean  and variance 2 • Chebychev: =P(x|e) is unknown => obtain it from samples!

Error Bound Derivation K is a Bernoulli random variable

Likelyhood Convergence 2 • Assume P(X=x|e) has mean  and variance 2 • Zero-One Estimation Theory (Karp et al.,1989): =P(x|e) is unknown => obtain it from samples!

Local Variance Bound (LVB)(Dagum&Luby, 1994) • Let  be LVB of a binary valued network:

LVB Estimate(Pradhan,Dagum,1996) • Using the LVB, the Zero-One Estimator can be re-written:

Importance Sampling Idea • In general, it is hard to sample from target distribution P(X|E) • Generate samples from sampling (proposal) distribution Q(X) • Weigh each sample against P(X|E)

Importance Sampling Variants Importance sampling: forward, non-adaptive • Nodes sampled in topological order • Sampling distribution (for non-instantiated nodes) equal to the prior conditionals Importance sampling: forward, adaptive • Nodes sampled in topological order • Sampling distribution adapted according to average importance weights obtained in previous samples [Cheng,Druzdzel2000]

AIS-BN • The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks. • Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.

Gibbs Sampling • Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Samples are dependent, form Markov Chain • Samples directly from P(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality

MCMC Sampling Fundamentals Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

MCMC Sampling From (X) A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

Gibbs Sampling (Pearl, 1988) • A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples

Ordered Gibbs Sampler Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order

Gibbs Sampling (cont’d)(Pearl, 1988) Markov blanket:

Ordered Gibbs Sampling Algorithm Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)

Answering Queries • Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):

Importance vs. Gibbs wt

Gibbs Sampling Example - BN X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling: Illustration

Gibbs Sampling: Burn-In • We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)

Gibbs Sampling: Convergence • Converge to stationary distribution * : * = * P where P is a transition kernel pij = P(Xi Xj) • Guaranteed to converge iff chain is : • irreducible • aperiodic • ergodic ( i,j pij > 0)

Irreducible • A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step). • In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.

Aperiodic • Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

Ergodicity • A recurrent state is a state to which the chain returns with probability 1: nP(n)ij =  • Recurrent, aperiodic states are ergodic. Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

Gibbs Convergence • Gibbs convergence is generally guaranteed as long as all probabilities are positive! • Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then: • once we sample and assign X=0, then we are forced to assign Y=0; • once we sample and assign Y=0, then we are forced to assign X=0;  we will never be able to change their values again! • Another problem: it can take a very long time to converge!

Sampling Bayesian Networks