660 likes | 842 Views
Inference in Gaussian and Hybrid Bayesian Networks. ICS 275B. Gaussian Distribution. 0.4. gaussian(x,0,1). gaussian(x,1,1). 0.35. 0.3. 0.25. 0.2. 0.15. 0.1. 0.05. 0. -3. -2. -1. 0. 1. 2. 3. N( m , s ). 0.4. gaussian(x,0,1). gaussian(x,0,2). 0.35. 0.3. 0.25. 0.2. 0.15.
E N D
0.4 gaussian(x,0,1) gaussian(x,1,1) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 N(m, s)
0.4 gaussian(x,0,1) gaussian(x,0,2) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 N(m, s)
Multivariate Gaussian Definition: Let X1,…,Xn. Be a set of random variables. A multivariate Gaussian distribution over X1,…,Xn is a parameterized by an n-dimensional mean vector and an n x n positive definitive covariance matrix . It defines a joint density via:
Linear Gaussian Distribution Definition: Let Y be a continuous node with continuous parents X1,…,Xk. We say that Y has a linear Gaussian model if it can be described using parameters 0, …,k and 2 such that: P(y| x1,…,xk)=N (μy + 1x1 +…,kxk ; ) =N([μy,1,…,k], )
A B A B A B
Linear Gaussian Network Definition Linear Gaussian Bayesian network is a Bayesian network all of whose variables are continuous and where all of the CPTs are linear Gaussians. Linear Gaussian BN Multivariate Gaussian =>Linear Gaussian BN has a compact representation
Problems: When we Multiply two arbitrary Gaussians! Inverse of K and M is always well defined. However, this inverse is not!
Theoretical explanation: Why this is the case ? • Inverse of a matrix of size n x n exists when the matrix is of rank n. • If all sigmas and w’s are assumed to be 1. • (K-1+M-1) has rank 2 and so is not invertible.
Density vs conditional • However, • Theorem: If the product of the gaussians represents a multi-variate gaussian density, then the inverse always exists. • For example, For P(A|B)*P(B)=P(A,B) = N(c,C) then inverse of C always exists. P(A,B) is a multi-variate gaussian (density). • But P(A|B)*P(B|X)=P(A,B|X) = N(c,C) then inverse of C may not exist. P(A,B|X) is a conditional gaussian.
Inference: A general algorithm Computing marginal of a given variable, say Z. Step 1: Convert all conditional gaussians to canonical form
Inference: A general algorithm Computing marginal of a given variable, say Z. • Step 2: • Extend all g’s,h’s and k’s to the same domain by adding 0’s.
Inference: A general algorithm Computing marginal of a given variable, say Z. • Step 3: Add all g’s, all h’s and all k’s. • Step 4: Let the variables involved in the computation be: P(X1,X2,…,Xk,Z)= N(μ,∑)
Inference: A general algorithm Computing marginal of a given variable, say Z. Step 5: Extract the marginal
Inference: Computing marginal of a given variable • For a continuous Gaussian Bayesian Network, inference is polynomial O(N3). • Complexity of matrix inversion • So algorithms like belief propagation are not generally used when all variables are Gaussian. • Can we do better than N^3? • Use Bucket elimination.
Multiplication operator bucket B: P(b|a) P(d|b,a) P(e|b,c) B bucket C: P(c|a) C bucket D: D bucket E: e=0 E bucket A: P(a) A P(a|e=0) W*=4 ”induced width” (max clique size) Bucket elimination Algorithm elim-bel (Dechter 1996) Marginalization operator
Multiplication Operator • Convert all functions to canonical form if necessary. • Extend all functions to the same variables • (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)
Multiplication operator bucket B: P(b|a) P(d|b,a) P(e|b,c) B bucket C: P(c|a) C bucket D: D bucket E: P(e) E bucket A: P(a) A P(a) W*=4 ”induced width” (max clique size) Again our problem! h(a,d,c,e) does not represent a density and so cannot be computed in our usual form N(μ,σ) Marginalization operator
Solution: Marginalize in canonical form • Although intermediate functions computed in bucket elimination are conditional, we can marginalize in canonical form, so we can eliminate the problem of non-existence of inverse completely.
Algorithm • In each bucket, convert all functions in canonical form if necessary, multiply them and marginalize out the variable in the bucket as shown in the previous slide. • Theorem: P(A) is a density and is correct. • Complexity: Time and space: O((w+1)^3) where w is the width of the ordering used.
Continuous Node, Discrete Parents Definition: Let X be a continuous node, and let U={U1,U2,…,Un} be its discrete parents and Y={Y1,Y2,…,Yk} be its continuous parents. We say that X has a conditional linear Gaussian (CLG) CPT if, for every value uD(U), we have a a set of (k+1) coefficients au,0, au,1, …, au,k+1 and a variance u2 such that:
CLG Network Definition: A Bayesian network is called a CLG network if every discrete node has only discrete parents, and every continuous node has a CLG CPT.
Inference in CLGs • Can we use the same algorithm? • Yes, but the algorithm is unbounded if we are not careful. • Reason: • Marginalizing out discrete variables from any arbitrary function in CLGs is not bounded. • If we marginalize out y and k from f(x,y,i,k) , the result is a mixture of 4 gaussians instead of 2. • X and y are continuous variables • I and k are discrete binary variables.
Solution: Approximate the mixture of Gaussians by a single gaussian
Multiplication and Marginalization Strong marginal when marginalizing continuous variables Multiplication • Convert all functions to canonical form if necessary. • Extend all functions to the same variables • (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2) Weak marginal when marginalizing discrete variables
Problem while using this marginalization in bucket elimination • Requires computing ∑ and μ which is not possible due to non-existence of inverse. • Solution: Use an ordering such that you never have to marginalize out discrete variables from a function that has both discrete and continuous gaussian variables. • Special case: Compute marginal at a discrete node • Homework: Derive a bucket elimination algorithm for computing marginal of a continuous variable.
Multiplication operator bucket B: P(b|a,e) P(d|b,a) P(d|b,c) bucket C: P(c|a) bucket D: bucket E: P(e) bucket A: P(a) P(a) W*=4 ”induced width” (max clique size) Special Case: A marginal on a discrete variable in a CLG is to be computed. B,C and D are continuous variables and A and E is discrete Marginalization operator
Complexity of the special case • Discrete-width (wd): Maximum number of discrete variables in a clique • Continuous-width (wc): Maximum number of continuous variables in a clique • Time: O(exp(wd)+wc^3) • Space: O(exp(wd)+wc^3)
Algorithm for the general case:Computing Belief at a continuous node of a CLG • Convert all functions to canonical form. • Create a special tree-decomposition • Assign functions to appropriate cliques (Same as assigning functions to buckets) • Select a Strong Root • Perform message passing
Creating a Special-tree decomposition • Moralize the Bayesian Network. • Select an ordering such that all continuous variables are ordered before discrete variables (Increases induced width).
Elimination order w x • Strong elimination order: • First eliminate continuous variables • Eliminate discrete variable when no available continuous variables y W and X are discrete variables and Y and Z are continuous. z Moralized graph has this edge
Elimination order (1) dim: 2 dim: 2 w x y dim: 2 z 1
Elimination order (2) dim: 2 dim: 2 w x y 2 z 1
Elimination order (3) 3 dim: 2 w x y 2 z 1
Elimination order (4) 3 4 3 4 w x w x 3 y w y 2 2 3 Cliques 2 w z y 1 2 separator y 2 Cliques 1 z 1
Bucket tree or Junction tree (1) w x w y Cliques 2: root w y separator y Cliques 1 z
Algorithm for the general case:Computing Belief at a continuous node of a CLG • Convert all functions to canonical form. • Create a special tree-decomposition • Assign functions to appropriate cliques (Same as assigning functions to buckets) • Select a Strong Root • Perform message passing
Assigning Functions to cliques • Select a function and place it in an arbitrary clique that mentions all variables in the function.
Algorithm for the general case:Computing Belief at a continuous node of a CLG • Convert all functions to canonical form. • Create a special tree-decomposition • Assign functions to appropriate cliques (Same as assigning functions to buckets) • Select a Strong Root • Perform message passing
Strong Root • We define a strong root as any node R in the bucket-tree which satisfies the following property: for any pair (V,W) which are neighbors on the tree with W closer to R than V, we have
Example Strong root Strong Root
Algorithm for the general case:Computing Belief at a continuous node of a CLG • Create a special tree-decomposition • Assign functions to appropriate cliques (Same as assigning functions to buckets) • Select a Strong Root • Perform message passing
x1 a b Message passing at a typical node x2 • Node “a” contains functions assigned to it according to the tree-decomposition scheme denoted by pj(a)
Distribute Collect root root root root Message Passing Two pass algorithm: Bucket-tree propagation Figure from P. Green
Lets look at the messagesCollect Evidence Strong Root ∫C ∫Mout ∫D ∫L ∫Min∫D