1 / 94

Sampling Bayesian Networks

Sampling Bayesian Networks. ICS 276 2007. Answering BN Queries. Probability of Evidence P(e) ? NP-hard Conditional Prob. P(x i |e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y  x ? NP PP -hard Approximating P(e) or P(x i |e) within : NP-hard.

turi
Download Presentation

Sampling Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Bayesian Networks ICS 276 2007

  2. Answering BN Queries • Probability of Evidence P(e) ? NP-hard • Conditional Prob. P(xi|e) ? NP-hard • MPE x = arg max P(x|e) ? NP-hard • MAP y = arg max P(y|e), y  x ? NPPP-hard • Approximating P(e) or P(xi|e) within : NP-hard

  3. Approximation Algorithms Structural Approximations • Eliminate some dependencies • Remove edges • Mini-Bucket Approach Search Approach for optimization tasks: MPE, MAP Sampling Generate random samples and compute values of interest from samples, not original network

  4. Algorithm Tree

  5. Sampling • Input: Bayesian network with set of nodes X • Sample = a tuple with assigned values s=(X1=x1,X2=x2,… ,Xk=xk) • Tuple may include all variables (except evidence) or a subset • Sampling schemas dictate how to generate samples (tuples) • Ideally, samples are distributed according to P(X|E)

  6. Sampling Fundamentals Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

  7. Sampling From (X) A sample St is an instantiation: Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

  8. Sampling Basics • Given random variable X, D(X)={0, 1} • Given P(X) = {0.3, 0.7} • Generate k samples: 0,1,1,1,0,1,1,0,1 • Approximate P’(X):

  9. How to draw a sample ? • Given random variable X, D(X)={0, 1} • Given P(X) = {0.3, 0.7} • Sample X  P (X) • draw random number r  [0, 1] • If (r < 0.3) then set X=0 • Else set X=1 • Can generalize for any domain size

  10. Sampling in BN • Same Idea: generate a set of samples T • Estimate P(Xi|E) from samples • Challenge: X is a vector and P(X) is a huge distribution represented by BN • Need to know: • How to generate a new sample ? • How many samples T do we need ? • How to estimate P(E=e) and P(Xi|e) ?

  11. Sampling Algorithms • Forward Sampling • Gibbs Sampling (MCMC) • Blocking • Rao-Blackwellised • Likelihood Weighting • Importance Sampling • Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

  12. Forward Sampling • Forward Sampling • Case with No evidence E={} • Case with Evidence E=e • # samples N and Error Bounds

  13. Forward Sampling No Evidence(Henrion 1988) Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: • For t = 0 to T • For i = 0 to N • Xi sample xit from P(xi | pai)

  14. r 0 0.3 1 Sampling A Value What does it mean to sample xit from P(Xi | pai) ? • Assume D(Xi)={0,1} • Assume P(Xi | pai) = (0.3, 0.7) • Draw a random number r from [0,1] If r falls in [0,0.3], set Xi = 0 If r falls in [0.3,1], set Xi=1

  15. Forward sampling (example) X1 X3 X2 X4

  16. Forward Sampling-Answering Queries Task: given T samples {S1,S2,…,Sn} estimate P(Xi = xi) : Basically, count the proportion of samples where Xi = xi

  17. Forward Sampling w/ Evidence Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E • For t=1 to T • For i=1 to N • Xi sample xit from P(xi | pai) • If Xi in E and Xi xi, reject sample: • i = 1 and go to step 2

  18. Forward sampling (example) X1 X3 X2 X4

  19. Forward Sampling: Illustration Let Y be a subset of evidence nodes s.t. Y=u

  20. Forward Sampling –How many samples? Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Chebychev’s Bound.

  21. Forward Sampling - How many samples? Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Hoeffding’s Bound (full proof is given in Koller).

  22. Forward Sampling:Performance Advantages: • P(xi | pa(xi)) is readily available • Samples are independent ! Drawbacks: • If evidence E is rare (P(e) is low), then we will reject most of the samples! • Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves! • If P(e) is small, T will become very big!

  23. Problem: Evidence • Forward Sampling • High Rejection Rate • Fix evidence values • Gibbs sampling (MCMC) • Likelihood Weighting • Importance Sampling

  24. Forward Sampling Bibliography • {henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

  25. Gibbs Sampling • Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) • Samples are dependent, form Markov Chain • Sample from P’(X|e)which converges toP(X|e) • Guaranteed to converge when all P > 0 • Methods to improve convergence: • Blocking • Rao-Blackwellised • Error Bounds • Lag-t autocovariance • Multiple Chains, Chebyshev’s Inequality

  26. Gibbs Sampling (Pearl, 1988) • A sample t[1,2,…],is an instantiation of all variables in the network: • Sampling process • Fix values of observed variables e • Instantiate node values in sample x0 at random • Generate samples x1,x2,…xT from P(x|e) • Compute posteriors from samples

  27. Ordered Gibbs Sampler Generate sample xt+1 from xt : In short, for i=1 to N: Process All Variables In Some Order

  28. Gibbs Sampling (cont’d)(Pearl, 1988) Markov blanket:

  29. Ordered Gibbs Sampling Algorithm Input: X, E Output: T samples {xt } • Fix evidence E • Generate samples from P(X | E) • For t = 1 to T (compute samples) • For i = 1 to N (loop through variables) • Xi sample xit from P(Xi | markovt \ Xi)

  30. Answering Queries • Query: P(xi |e) = ? • Method 1: count #of samples where Xi=xi: Method 2: average probability (mixture estimator):

  31. Gibbs Sampling Example - BN X = {X1,X2,…,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  32. Gibbs Sampling Example - BN X1 = x10X6 = x60 X2 = x20X7 = x70 X3 = x30X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7

  33. Gibbs Sampling Example - BN X1 P (X1 |X02,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  34. Gibbs Sampling Example - BN X2 P(X2 |X11,…,X08 ,X9} E = {X9} X1 X3 X6 X2 X5 X8 X9 X4 X7

  35. Gibbs Sampling: Illustration

  36. Gibbs Sampling: Burn-In • We want to sample from P(X | E) • But…starting point is random • Solution: throw away first K samples • Known As “Burn-In” • What is K ? Hard to tell. Use intuition. • Alternatives: sample first sample valkues from approximate P(x|e) (for example, run IBP first)

  37. Gibbs Sampling: Convergence • Converge to stationary distribution * : * = * P where P is a transition kernel pij = P(Xi Xj) • Guaranteed to converge iff chain is : • irreducible • aperiodic • ergodic ( i,j pij > 0)

  38. Irreducible • A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step). • In other words, i,j k : P(k)ij > 0 where k is the number of steps taken to get to state j from state i.

  39. Aperiodic • Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

  40. Ergodicity • A recurrent state is a state to which the chain returns with probability 1: nP(n)ij =  • Recurrent, aperiodic states are ergodic. Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

  41. Gibbs Convergence • Gibbs convergence is generally guaranteed as long as all probabilities are positive! • Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then: • once we sample and assign X=0, then we are forced to assign Y=0; • once we sample and assign Y=0, then we are forced to assign X=0;  we will never be able to change their values again! • Another problem: it can take a very long time to converge!

  42. Gibbs Sampling: Performance +Advantage: guaranteed to converge to P(X|E) -Disadvantage: convergence may be slow Problems: • Samples are dependent ! • Statistical variance is too big in high-dimensional problems

  43. Gibbs: Speeding Convergence Objectives: • Reduce dependence between samples (autocorrelation) • Skip samples • Randomize Variable Sampling Order • Reduce variance • Blocking Gibbs Sampling • Rao-Blackwellisation

  44. Skipping Samples • Pick only every k-th sample (Gayer, 1992) Can reduce dependence between samples ! Increases variance ! Waists samples !

  45. Randomized Variable Order Random Scan Gibbs Sampler Pick each next variable Xi for update at random with probability pi , i pi = 1. (In the simplest case, pi are distributed uniformly.) In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance Reduction”)

  46. Blocking • Sample several variables together, as a block • Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: Xt+1 P(yt,zt)=P(wt) (yt+1,zt+1)=Wt+1 P(xt+1) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!

  47. Blocking Gibbs Sampling Jensen, Kong, Kjaerulff, 1993 “Blocking Gibbs Sampling Very Large Probabilistic Expert Systems” • Select a set of subsets: E1, E2, E3, …, Ek s.t. Ei X Ui Ei = X Ai = X \ Ei • Sample P(Ei | Ai)

  48. Rao-Blackwellisation • Do not sample all variables! • Sample a subset! • Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample: Xt+1 P(yt) yt+1  P(xt+1)

  49. Rao-Blackwell Theorem Bottom line: reducing number of variables in a sample reduce variance!

  50. Blocking vs. Rao-Blackwellisation • Standard Gibbs: P(x|y,z),P(y|x,z),P(z|x,y) (1) • Blocking: P(x|y,z), P(y,z|x) (2) • Rao-Blackwellised: P(x|y), P(y|x) (3) Var3 < Var2 < Var1 [Liu, Wong, Kong, 1994 Covariance structure of the Gibbs sampler…] X Y Z

More Related