Complexity of Inference in Bayesian Networks

Approximate Inference Edited from Slides by Nir Friedman .

Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates.

Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem: • q1,…,qn be propositions, • 1 ,... ,k be clauses, such that i = li1 li2  li3 where each lij is a literal over q1,…,qn •  = 1... k We will construct a Bayesian network s.t. P(X=t) > 0 iff  is satisfiable

... P(Qi = true) = 0.5, P(I = true| Qi , Qj , Ql ) = 1 iff Qi , Qj , Qlsatisfy the clause I A1, A2, …, are simple binary AND gates Q1 Q2 Q3 Q4 Qn ... k-1 k 1 2 3 ... X A2 Ak/2-1 A1

It is easy to check • Polynomial number of variables • Each local probability table can be described by a small table (8 parameters at most) • P(X = true) > 0 if and only if there exists a satisfying assignment to Q1,…,Qn • Conclusion: polynomial reduction of 3-SAT

Note: this construction also shows that computing P(X = t) is harder than NP • 2nP(X = t) is the number of satisfying assignments to  • Thus, it is #P-hard.

Hardness - Notes • We used deterministic relations in our construction • The same construction works if we use (1-, ) instead of (1,0) in each gate for any  < 0.5 • Hardness does not mean we cannot solve inference • It implies that we cannot find a general procedure that works efficiently for all networks • For particular families of networks, we can have provably efficient procedures • We have seen such families in the course: HMMs, Evolutionary trees.

Approximation • Until now, we examined exact computation • In many applications, approximation are sufficient • Example: P(X = x|e) = 0.3183098861838 • Maybe P(X = x|e)  0.3 is a good enough approximation • e.g., we take action only if P(X = x|e) > 0.5 • Can we find good approximation algorithms?

Types of Approximations Absolute error • An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e) q +  • Absolute error is not always what we want: • If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable • If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 1 q 2 0

Types of Approximations Relative error • An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - )  q  P(X = x|e)(1 + ) equivalently q/(1 + )  P(X = x|e)  q/(1 - ) • Sensitivity of approximation depends on actual value of desired result 1 q/(1-) q q/(1+) 0

Complexity • Exact inference is NP-hard • Is approximate inference any easier? • Construction for exact inference: • Input: a 3-SAT problem  • Output: a BN such that P(X=t) > 0 iff  is satisfiable

Complexity: Relative Error • Suppose that q is a relative error estimate ofP(X = t)=0. • Then, Theorem: Given , finding an -relative error approximation is NP-hard. 0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0 namely, q=0. Thus, -relative error and exact computation coincide for the value 0.

Complexity: Absolute error Theorem • If  < 0.5, then finding an estimate of P(X=x|e) with  absolute error approximation is NP-Hard

... Q1 Q2 Q3 Q4 Qn ... k-1 k 1 2 3 ... ... A1 X A2 Proof • Recall our construction

Proof (cont.) • Suppose we can estimate with  absolute error • Let p1 P(Q1 = t | X = t) • Assign q1 = t if p1 > 0.5, else q1 = f • Let p2 P(Q2 = t | X = t, Q1 = q1 ) • Assign q2 = t if p2 > 0.5, else q2 = f • … • Let pn P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) • Assign qn = t if pn > 0.5, else qn = f

Proof (cont.) Claim: if  is satisfiable, then q1,…,qn is a satisfying assignment • Suppose  is satisfiable • By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi • Base case: • If Q1 = t in all satisfying assignments, • P(Q1 = t | X = t) = 1 • p1 1 -  > 0.5 •  q1 = t • If Q1 = f, in all satisfying assignments, then q1 = f • Otherwise, the statement holds for any choice of q1

Proof (cont.) • Induction argument: • If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi • P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1 • pi+1 1 -  > 0.5 •  qi+1 = t • If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f. Otherwise, the statement holds for any choice of qi .

Proof (cont.) • We can efficiently check whether q1,…,qn is a satisfying assignment (linear time) • If it is, then  is satisfiable • If it is not, then  is not satisfiable • Suppose we have an approximation procedure with  relative error. • We can determine 3-SAT with n procedure calls. We generate an assignment as in the proof, and check satisfyability of the resulting assignment in linear time. If there were a satisfiable solution, we showed one would find it, and if no such assignment exists, one won’t find it. • Thus, approximation is NP-hard.

When can we hope to approximate? Two situations: • “Peaked” distributions improbable values are ignored • Highly stochastic distributions “Far” evidence is discarded. (E.g., far markers in genetic linkage analysis)

Stochastic Simulation • Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn) • What is the probability that a random sample <x1,…,xn> satisfies e? • This is exactly P(e) • We can view each sample as tossing a biased coin with probability P(e) of “Heads”

1 or 0 Stochastic Sampling • Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate • Law of large number implies that as N grows, our estimate will converge to p with high probability.

Sampling a Bayesian Network • If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? • YES: sample according to structure of the network: sample each variable given its sampled parents

0.03 Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8

0.001 Burglary Earthquake Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e

Burglary Earthquake 0.4 Radio Alarm Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a

Burglary Earthquake Radio Alarm 0.8 Call B E A C R b b e e a b e b e Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c

Burglary Earthquake Radio Alarm 0.3 Call B E A C R b b e e a b e b e r Samples: Logic sampling P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 e P(r) 0.3 0.001 a P(c) 0.05 0.8 e a c

Logic Sampling • Let X1, …, Xn be order of variables consistent with arc direction • for i = 1, …, n do • sample xi from P(Xi | pai ) • (Note: since Pai {X1,…,Xi-1}, we already assigned values to them) • return x1, …,xn

Logic Sampling • Sampling a complete instance is linear in number of variables • Regardless of structure of the network • However, if P(e) is small, we need many samples to get a decent estimate

Can we sample from P(X1,…,Xn |e)? • If evidence is in roots of the network, as before. • If evidence is in leaves of the network, we have a problem: Our sampling method proceeds according to order of nodes in graph. We need to retain only those samples that match e. This might be a rare event.

Y X Likelihood Weighting • Can we ensure that all of our sample is used? • One wrong (but fixable) approach: • When we need to sample a variable that is assigned a value by e, use that specified value. • For example: we know Y = 1 • Sample X from P(X) • Then take Y = 1 • This is NOT a sample from P(X,Y|Y = 1) !

1 or 0 Y X Likelihood Weighting • Problem: these samples of X are from P(X) • Solution: • Penalize samples in which P(Y=1|X) is small • We now sample as follows: • Let x[i] be a sample from P(X) • Let w[i] be P(Y = 1|X = x [i])

0.03 Burglary Earthquake = a Radio Alarm Call Weight b b e r b e b e Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a a P(c) B E A C R 0.05 0.8 Samples:

0.001 Burglary Earthquake = a Radio Alarm Call B E A C R Weight b b e a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 e

Burglary Earthquake 0.4 = a Radio Alarm Call B E A C R Weight b b e a a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 0.6 e

Burglary Earthquake = a Radio Alarm 0.05 Call B E A C R Weight b b e a a r b e b e Samples: Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) 0.05 0.8 0.6 e c

Burglary Earthquake = a Radio Alarm 0.3 Call Weight b b e a a r b e b e Likelihood Weighting P(b) 0.03 P(e) 0.001 b e P(a) 0.4 0.01 0.98 0.7 = r r P(r) 0.3 0.001 a P(c) B E A C R 0.05 0.8 0.6 *0.3 e c r Samples:

Likelihood Weighting • Let X1, …, Xn be order of variables consistent with arc direction • w = 1 • for i = 1, …, n do • if Xi = xi has been observed • w w  P(Xi= xi| pai ) • else • sample xi from P(Xi | pai ) • return x1, …,xn, and w

Likelihood Weighting • Why does this make sense? • When N is large, we expect to sample NP(X = x) samples with x[i] = x • Thus,

Theorem (Dagum & Luby AIJ93): • If P(Xi | Pai) [l,u] for all local probability tables, and • then with probability 1-, the estimate is  relative error approximation Likelihood Weighting What can we say about the quality of answer? • Intuitively, the weights of a sample reflects their probability given the evidence. We need collect a enough mass for the sample to provide accurate answer. • Another factor is the “extremeness” of CPDs.

END

Complexity of Inference in Bayesian Networks