760 likes | 928 Views
Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4). Lecture Overview. Exact Inference: Variable Elimination Factors Algorithm Approximate Inference: sampling methods Forward (prior) sampling Rejection sampling Likelihood weighting. Bayesian Networks: Types of Inference.
E N D
Computer Science CPSC 502 Lecture 9 (up-to Ch. 6.4.2.4)
Lecture Overview • Exact Inference: Variable Elimination • Factors • Algorithm • Approximate Inference: sampling methods • Forward (prior) sampling • Rejection sampling • Likelihood weighting
Bayesian Networks: Types of Inference Predictive Intercausal Diagnostic Mixed Fire happens F=t There is no fireF=f Person smokes next to sensorS=t Fire Fire Fire P(F|A=t,T=t)=? P(F|L=t)=? Smoking at Sensor Fire Alarm Alarm Alarm P(A|F=f,L=t)=? Alarm Leaving Leaving Leaving Alarm goes off P(a) = 1.0 People are leaving L=t People are leavingL=t P(L|F=t)=? We will use the same reasoning procedure for all of these types
Inference • Y is the query variable; • E1=e1, …, Ej=ejare the observed variables (with their values) • Z1, …,Zk are the remaining variables • We need to compute this numerator for each value of Y, yi • We need to marginalize over all the variables Z1,…Zk not involved in the query Def of conditional probability • To compute the denominator, marginalize over Y • - Same value for every P(Y=yi). Normalization constant ensuring that • Variable Elimination is an algorithm that efficiently performs this • operations by casting them as operations between factors
Factors • A factor is a function from a tuple of random variables to the real numbers R • We write a factor on variables X1,… ,Xjas f(X1,… ,Xj) • A factor denotes one or more (possibly partial) distributions over the given tuple of variables, e.g., • P(X1, X2) is a factor f(X1, X2) • P(Z | X,Y) is a factor • f(Z,X,Y) • P(Z=f|X,Y) is a factor f(X,Y) • Note: Factors do not have to sum to one Distribution Set of Distributions One for each combination of values for X and Y f(X, Y ) Z = f Set of partial Distributions
Operation 1: assigning a variable • We can make new factors out of an existing factor • Our first operation:we can assign some or all of the variables of a factor. • What is the result of assigning X= t ? f(X=t,Y,Z) =f(X, Y, Z)X = t Factor of Y,Z
More examples of assignment f(X=t,Y,Z) Factor of Y,Z f(X=t,Y,Z=f): Number Factor of Y
Operation 2: Summing out a variable • Our second operation on factors: we can marginalize out (or sum out) a variable • Exactly as before. Only difference: factors don’t sum to 1 • Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new factor defined on {X1,… ,Xn } \ {X} (Bf3)(A,C)
Operation 2: Summing out a variable • Our second operation on factors: we can marginalize out (or sum out) a variable • Exactly as before. Only difference: factors don’t sum to 1 • Marginalizing out a variable X from a factor f(X1,… ,Xn) yields a new factor defined on {X1,… ,Xn} \ {X} (Bf3)(A,C)
Operation 3: multiplying factors f1(A,B)× f2(B,C):
Recap: Factors and Operations on Them • If we assign variable A=a in factor f7(A,B), what is the correct form for the resulting factor? • f(B). When we assign variable A we remove it from the factor’s domain • If we marginalize variable A out from factor f7(A,B), what is the correct form for the resulting factor? • f(B). When we marginalize out variable A we remove it from the factor’s domain • If we multiply factors f4(X,Y) and f6(Z,Y), what is the correct form for the resulting factor? • f(X,Y,Z) • When multiplying factors, the resulting factor’s domain is the union of the multiplicands’domains • What is the correct form for B f5(A,B) × f6(B,C) • As usual, product before sum: B ( f5(A,B) × f6(B,C) ) • Result of multiplication: f(A,B,C). Then marginalize out B: f’(A,C)
Remember our goal • Y: subset of variables that is queried • E: subset of variables that are observed . E = e • Z1, …,Zkremaining variables in the JPD • We need to compute this numerator for each value of Y, yi • We need to marginalize over all the variables Z1,…Zk not involved in the query Def of conditional probability • To compute the denominator, marginalize over Y • - Same value for every P(Y=yi). Normalization constant ensuring that • All we need to compute is the numerator: joint probability of the query variable(s) • and the evidence! • Variable Elimination is an algorithm that efficiently performs this operation by • casting it as operations between factors
Lecture Overview • Exact Inference: Variable Elimination • Factors • Algorithm • Approximate Inference: sampling methods • Forward (prior) sampling • Rejection sampling • Likelihood weighting
Variable Elimination: Intro (1) • We can express the joint probability as a factor • f(Y, E1…, Ej, Z1…,Zk) • We can compute P(Y, E1=e1, …, Ej=ej) by • AssigningE1=e1, …, Ej=ej • Marginalizing out variables Z1, …, Zk, one at a time • the order in which we do this is called our elimination ordering • Are we done? observed Other variables not involved in the query No, this still represents the whole JPD (as a single factor)! Need to exploit the compactness of Bayesian networks
Variable Elimination Intro (2) • We can express the joint factor as a product of factors, one for each conditional probability Recall the JPD of a Bayesian network
Computing sums of products • Inference in Bayesian networks thus reduces to computing the sums of products • To compute efficiently • Factor out those terms that don't involve Zk, e.g.:
Decompose sum of products General case Factors that do not contain Z1 Factors that contain Z1 Factors that contain Z2 Factors that contain Z1 Factors that do not contain Z2 nor Z1 Etc., continue given a predefined simplification ordering of the variables: variable elimination ordering
6. Normalize by dividing the resulting factor f(Y) by The variable elimination algorith, To compute P(Y=yi| E = e) • Construct a factor for each conditional probability. • For each factor, assign the observed variables E to their observed values. • Given an elimination ordering, decompose sum of products • Sum out all variables Zinot involved in the query • Multiply the remaining factors (which only involve ) See the algorithm VE_BN in the P&M text, Section 6.4.1, Figure 6.8, p. 254.
Variable elimination example Compute P(G|H=h1). P(G,H) = A,B,C,D,E,F,IP(A,B,C,D,E,F,G,H,I) = = A,B,C,D,E,F,IP(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G)
Step 1: Construct a factor for each cond. probability Compute P(G|H=h1). P(G,H) = A,B,C,D,E,F,IP(A)P(B|A)P(C)P(D|B,C)P(E|C)P(F|D)P(G|F,E)P(H|G)P(I|G) P(G,H) = A,B,C,D,E,F,If0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G)
Step 2: assign to observed variables their observed values. Compute P(G|H=h1). Previous state: P(G,H) = A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f7(H,G) f8(I,G) ObserveH : P(G,H=h1)=A,B,C,D,E,F,I f0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E) f9(G) f8(I,G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f9(G) H=h1
Step 3: Decompose sum of products Compute P(G|H=h1). Previous state: P(G,H=h1) = A,B,C,D,E,F,If0(A) f1(B,A) f2(C) f3(D,B,C) f4(E,C) f5(F, D) f6(G,F,E)f9(G)f8(I,G) Elimination ordering A, C, E, I, B, D, F: P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G)E f6(G,F,E)C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f9(G)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = f9(G) F D f5(F, D) B I f8(I,G)E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) A f0(A) f1(B,A) Eliminate A: perform product and sum out A in P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) I f8(I,G)E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) • f10(B) does not depend • on C, E, or I, so we can • push it outside of those • sums. • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = f9(G) F D f5(F, D) B f10(B)I f8(I,G)E f6(G,F,E) C f2(C) f3(D,B,C) f4(E,C) Eliminate C: perform product and sum out C in P(G,H=h1) = f9(G) F D f5(F, D) B f10(B)I f8(I,G)E f6(G,F,E)f11(B,D,E) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B)I f8(I,G)E f6(G,F,E)f11(B,D,E) Eliminate E: perform product and sum out E in P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B) f12(B,D,F,G) I f8(I,G) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = P(G,H=h1) = f9(G) F D f5(F, D) B f10(B)f12(B,D,F,G) If8(I,G) Eliminate I: perform product and sum out I in P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B)f12(B,D,F,G) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G) • f13(G)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) B f10(B) f12(B,D,F,G) Eliminate B: perform product and sum out B in P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G) • f13(G) • f14(D,F,G)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F D f5(F, D) f14(D,F,G) Eliminate D: perform product and sum out D in P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G) • f13(G) • f14(D,F,G) • f15(F,G)
Step 4: sum out non query variables (one at a time) Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = P(G,H=h1) = f9(G) f13(G)F f15(F,G) Eliminate F: perform product and sum out F in f9(G) f13(G)f16(F,G) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G) • f13(G) • f14(D,F,G) • f15(F,G) • f16(G)
Step 5: Multiply remaining factors Compute P(G|H=h1). Elimination order: A,C,E,I,B,D,F Previous state: P(G,H=h1) = f9(G) f13(G)f16(G) Multiply remaining factors (all in G): P(G,H=h1) =f17(G) • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G) • f17(G) • f13(G) • f14(D,F,G) • f15(F,G) • f16(G)
Step 6: Normalize Compute P(G|H=h1). • f9(G) • f0(A) • f1(B,A) • f2(C) • f3(D,B,C) • f4(E,C) • f5(F, D) • f6(G,F,E) • f7(H,G) • f8(I,G) • f10(B) • f11(B,D,E) • f12(B,D,F,G) • f17(G) • f13(G) • f14(D,F,G) • f15(F,G) • f16(G)
VE and conditional independence • So far, we haven’t use conditional independence! • Before running VE, we can prune all variables Z that are conditionally independent of the query Y given evidence E: Z ╨ Y | E • They cannot change the belief over Y given E! • Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ?
VE and conditional independence • So far, we haven’t use conditional independence! • Before running VE, we can prune all variables Z that are conditionally independent of the query Y given evidence E: Z ╨ Y | E • They cannot change the belief over Y given E! • Example: which variables can we prune for the query P(G=g| C=c1, F=f1, H=h1) ? • A, B, and D. Both paths from these nodes to G are blocked • F is observed node in chain structure • C is an observed common parent • Thus, we only need to consider this subnetwork
Variable elimination: pruning • We can also prune unobserved leaf nodes • Since they are unobserved and not predecessors of the query nodes, they cannot influence the posterior probability of the query nodes • Thus, if the query is • P(G=g| C=c1, F=f1, H=h1) • we only need to consider this • subnetwork Slide 39
One last trick • We can also prune unobserved leaf nodes • And we can do so recursively • E.g., which nodes can we prune if the query is P(A)? • Recursively prune unobserved leaf nodes: • we can prune all nodes other than A !
Complexity of Variable Elimination (VE) • A factor over n binary variables has to store 2n numbers • The initial factors are typically quite small (variables typically only have few parents in Bayesian networks) • But variable elimination constructs larger factors by multiplying factors together • The complexity of VE is exponential in the maximum number of variables in any factor during its execution • This number is called the treewidth of a graph (along an ordering) • Elimination ordering influences treewidth • Finding the best ordering is NP complete • I.e., the ordering that generates the minimum treewidth • Heuristics work well in practice (e.g. least connected variables first) • Even with best ordering, inference is sometimes infeasible • In those cases, we need approximate inference.
VE in AISpace • To see how variable elimination works in the Aispace Applet • Select “Network options -> Query Models > verbose” • Compare what happens when you select “Prune Irrelevant variables” or not in the VE window that pops up when you query a node • Try different heuristics for elimination ordering
Lecture Overview • Exact Inference: Variable Elimination • Factors • Algorithm • Approximate Inference: sampling methods • Forward (prior) sampling • Rejection sampling • Likelihood weighting
Sampling: What is it? • Problem: how to estimate probability distributions that are hard to compute via exact methods. • Idea: Estimate probabilities from sample data (samples) of the (unknown) probabilities distribution • Use frequency of each event in the sample data to approximate its probability • Frequencies are good approximations only if based on large samples • But these samples are often not easy to obtain from real-world observations • How do we get the samples?
We use Sampling • Sampling is a process to obtain samples adequate to estimate an unknown probability • The samples are generated from a known probability distribution P(x1) P(xn)
Generating Samples from a Distribution • For a random variable X with • values {x1,…,xk} • Probability distribution P(X) = {P(x1),…,P(xk)} • Partition the interval (0, 1] into k intervals pi , one for each xi , with length P(xi ) • To generate one sample • Randomly generate a value y in (0, 1] (i.e. generate a value from a uniform distribution over (0, 1]). • Select the value of the sample based on the interval pi that includes y • From probability theory:
Example • Consider a random variable Lecture with • 3 values <good, bad, soso> • with probabilities 0.7, 0.1 and 0.2 respectively. • We can have a sampler for this distribution by: • Using a random number generatorthatoutputs numbers over (0, 1] • Partition (0,1] into 3 intervals corresponding to the probabilities of the three Lecture values: (0, 0.7], (0.7, 0.8] and (0.8, 1]): • To obtain a sample, generate a random number n and pick the value for Lecture based on which interval n falls into: • P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good) • P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad) • P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso)
Random n sample 0.3 0.1 0.73 0.87 0.2 0.5 0.9 Example • P (0 < n ≤ 0.7) = 0.7 = P(Lecture = good) • P (0.7 < n ≤ 0.8) = 0.1 = P(Lecture = bad) • P (0.8 < n ≤ 1) = 0.2 = P(Lecture = soso) good good bad soso good good soso • If we generate enough samples, the frequencies of the three values will get close to their probability
Samples as Probabilities • Count total number of samples m • Count the number ni of samples xi • Generate the frequency of sample xi as ni/ m • This frequency is your estimated probability of xi
Sampling for Bayesian Networks • OK, but how can we use all this for probabilistic inference in Bayesian networks? • As we said earlier, if we can’t use exact algorithms to update the network, we need to resort to samples and frequencies to compute the probabilities we are interested in • We generate these samples by relying on the mechanism we just described
P(A=1) 0.3 A P(B=1|A) 1 0 0.7 0.1 Sampling for Bayesian Networks (N) • Suppose we have the following BN with two binary variables A B • It corresponds to the joint probability distribution • P(A,B) =P(B|A)P(A) • To sample from this distribution • we first sample from P(A). Suppose we get A = 0. • In this case, we then sample from P(B|A = 0). • If we had sampled A = 1, then in the second step we would have sampled from P(B|A = 1).
Forward (or Prior) Sampling • In a BN • we can order parents before children (topological order) and • we have CPTs available. • If no variables are instantiated (i.e., there is no evidence), this allows a simple algorithm: forward sampling. • Just sample variables in some fixed topological order, using the previously sampled values of the parents to select the correct distribution to sample from.
P(C=T) 0.5 C P(S=T|C) 0.1 0.5 T F C P(R=T|C) T F 0.8 0.2 S R P(W=T|S,R) T T 0.99 T F 0.9 F T 0.9 F F 0.1 Example Random => 0.4 Sample=> Cloudy= Cloudy Sprinkler Rain Random => 0.8 Sample=> Sprinkler = Wet Grass Random => 0.4 Sample=> Rain = Random => 0.7 Sample=> Wet Grass =
sample # Cloudy Sprinkler Rain Wet Grass 1 T F T T 2 3 ........ n Example • We generate as many samples as we can afford • Then we can use them to compute the probability of any partially specified event, • e.g. the probability of Rain = T in my Sprinkler network • by computing the frequency of that event in the sample set • So, if we generate 1000 samples from the Sprinkler network, and 511 of then have Rain = T, then the estimated probability of rain is