Global Approximate Inference

Global Approximate Inference Eran Segal Weizmann Institute

General Approximate Inference • Strategy • Define a class of simpler distributions Q • Search for a particular instance in Q that is “close” to P • Answer queries using inference in Q

Cluster Graph • A cluster graph K for factors F is an undirected graph • Nodes are associated with a subset of variablesCiU • The graph is family preserving: each factor F is associated with one node Cisuch that Scope[]Ci • Each edge Ci–Cj is associated with a sepsetSi,j= Ci Cj • A cluster tree over factors F that satisfies the running intersection property is called a clique tree

D G,I G,S G,J C,D G,I,D G,S,I G,J,S,L H,G,J • Verify: • Tree and family preserving • Running intersection property 2 3 5 4 1 P(G|I,D) P(H|G,J) P(C) P(I) P(L|G) P(D|C) P(S|I) P(J|L,S) Clique Tree Inference C I D G S L J H

Message Passing: Belief Propagation • Initialize the clique tree • For each clique Ci set • For each edge Ci—Cj set • While unset cliques exist • Select Ci—Cj • Send message from Ci to Cj • Marginalize the clique over the sepset • Update the belief at Cj • Update the sepset at Ci–Cj

Clique Tree Invariant • Belief propagation can be viewed as reparameterizing the joint distribution • Upon calibration we showed • Initially this invariant holds since • At each update step invariant is also maintained • Message only changes i and i,j so most terms remain unchanged • We need to show • But this is exactly the message passing step •  Belief propagation reparameterizes P at each step

Global Approximate Inference • Inference as optimization • Generalized Belief Propagation • Define algorithm • Constructing cluster graphs • Analyze approximation guarantees • Propagation with approximate messages • Factorized messages • Approximate message propagation • Structured variational approximations

Minimizing D(Q||PF) is equivalent to maximizing F[PF,Q] • lnZ  F[PF,Q] (since D(Q||PF)0) The Energy Functional • Suppose we want to approximate P with Q • Represent P by factors • Define the energy functional • Then:

Inference as Optimization • We show that inference can be viewed as maximizing the energy functional F[PF,Q] • Define a distribution Q over clique potentials • Transform F[PF,Q] to an equivalent factored form F’[PF,Q] • Show that if Q maximizes F’[PF,Q] subject to constraints in which Q represents calibrated potentials, then there exists factors that satisfy the inference message passing equations

Defining Q • Recall that throughout BP • Define Q as reparameterization of P such that • Since D(Q||PF)=0 we show that calibrating Q is equivalent to maximizing F[PF,Q]

Factored Energy Functional • Define the factored energy functional as • Theorem: if Q is a set of calibrated potentials for T, then F[PF,Q]=F’[PF,Q]

Inference as Optimization • Optimization task • Find Q that maximizes F’[PF,Q] subject to • Theorem: fixed points of Q for the above optimization • Suggests iterative optimization procedure • Identical to belief propagation!

Perform belief propagation in a cluster graph with loops A A A,B A,D A,B,D D B B D B,D B,C C,D B,C,D C C Cluster graph Cluster tree Bayesian network Generalized Belief Propagation Strategy:

Perform belief propagation in a cluster graph with loops Generalized Belief Propagation • Inference may be incorrect: double counting evidence • Unlike in BP on trees: • Convergence is not guaranteed • Potentials in calibrated tree are not guaranteed to be marginals in P Strategy: A A,B A,D D B B,C C,D C Cluster graph

Perform belief propagation in a cluster graph with loops Generalized Belief Propagation Strategy: A A,B A,D D B B,C C,D C Cluster graph

Generalized Cluster Graph • A cluster graph K for factors F is an undirected graph • Nodes are associated with a subset of variablesCiU • The graph is family preserving: each factor F is associated with one node Cisuch that Scope[]Ci • Each edge Ci–Cj is associated with a sepsetSi,j= Ci Cj • A generalized cluster graph K for factors F is an undirected graph • Nodes are associated with a subset of variablesCiU • The graph is family preserving: each factor F is associated with one node Cisuch that Scope[]Ci • Each edge Ci–Cj is associated with a subset Si,j Ci Cj

Generalized Cluster Graph • A generalized cluster graph obeys the running intersection property if for each XCi and XCj, there is exactly one path between Ci and Cj for which XS for each subset S along the path •  All edges associated with X form a tree that spans all the clusters that contain X • Note: some of these clusters may beconnected with more than one path A A,B A,D D B B,C C,D C

Calibrated Cluster Graph • A generalized cluster graph is calibrated if for each edge Ci – Cj we have: • Weaker than in clique trees, since Si,j is a subset of the intersection between Ci and Cj • If a cluster graph satisfies the running intersection property, then the marginal on any variable X is the same in every cluster that contains X

GBP is Efficient X12 X11,X12 X12,X13 X11 X12 X13 X11 X13 X12 X21 X22 X23 X11,X21 X12,X22 X13,X23 X22 X23 X21 X31 X32 X33 X22 X21,X22 X22,X23 Markov grid network X21 X22 X23 X21,X31 X22,X32 X23,X33 Note: clique tree in a n x n grid is exponential in n Round of GBP is O(n) X32 X33 X31 X31,X32 X32,X33 X32 Cluster graph

Constructing Cluster Graphs • When constructing clique trees, all constructions give the same result but differ in computational complexity • In GBP, different cluster graphs can vary in both computational complexity and approximation quality

Transforming Pairwise MNs • A pairwise Markov network over a graph H has: • A set of node potentials {[Xi]:i=1,...n} • A set of edge potentials {[Xi,Xj]: Xi,XjH} • Example: X11 X12 X13 X11,X12 X12,X13 X11 X12 X13 X11,X21 X12,X22 X13,X23 X21 X22 X23 X21 X22 X23 X21,X22 X22,X23 X31 X32 X33 X21,X31 X22,X32 X23,X33 X31 X32 X33 X31,X32 X32,X33

Transforming Bayesian Networks • Example: • “Large” cluster per each CPD • Single nodes for each variable • Connect node and large cluster if node in CPD •  Graph obeys running intersection property B A A,B,C A,B,D B,D,F C D B D F C A F Bethe approximation

Generalized Belief Propagation • GBP maintains distribution invariance • (since message passing maintains invariance)

Generalized Belief Propagation • If GBP converges (K is calibrated) • Each subtree T is calibrated with edge potentials corresponding to marginals of PT(U) • (since PT(U) is a calibrated tree)

Generalized Belief Propagation •  Calibrated graph potentials are not PF(U) marginals A A,B A,B A,D 1 1 4 D B B B,C C,D B,C C,D C C 2 3 2 3

Inference as Optimization • Optimization task • Find Q that maximizes F’[PF,Q] subject to • Theorem: fixed points of Q for the above optimization • Suggests iterative optimization procedure • Identical to belief propagation!

GBP as Optimization • Optimization task • Find Q that maximizes F’[PF,Q] subject to • Theorem: fixed points of Q for the above optimization • Note: Si,j is only a subset of intersection between Ci and Cj • Iterative optimization procedure is GBP

GBP as Optimization • Clique trees • F[PF,Q]=F’[PF,Q] • Iterative procedure (BP) guaranteed to converge • Convergence point represents marginal distributions of PF • Cluster graphs • F[PF,Q]=F’[PF,Q] does not hold! • Iterative procedure (GBP) not guaranteed to converge • Convergence point does not represent marginal distributions of PF

GBP in Practice • Dealing with non-convergence • Often small portions of the network do not converge •  stop inference and use current beliefs • Use intelligent message passing scheduling • Tree reparameterization (TRP) selects entire trees, and calibrates them while keeping all other beliefs fixed • Focus attention on uncalibrated regions of the graph

Propagation w. Approximate Msgs • General idea • Perform BP (or GBP) as before, but propagate messages that are only approximate • Modular approach • General inference scheme remains the same • Can plug in many different approximate message computations

Factorized Messages • Keep internal structure of the clique tree cliques • Calibration involves sending messages that are joint over three variables • Idea: simplify messages using factored representation • Example: X11 X12 X13 X11 X11 X12 X12 X13 X21 X22 X21 X22 X23 X21 X22 X23 X31 X32 X33 X31 X32 X33 X31 X32 1 2 3 Markov network Clique tree

Computational Savings • Answering queries in Cluster 2 • Exact inference: • Exponential in joint space of cluster 2 • Approximate inference with factored messages • Notice that subnetwork with factored messages is a tree • Perform efficient exact inference on subtree to answer queries X11 X11 X12 X12 X21 X22 X21 X22 X31 X32 X31 X32 1 2 3

Factor Sets • A factor set ={1,...,k} provides a compact representation for high-dimensional factor 1,...,k • Belief propagation • Multiplication of factor sets • Easy: simply the union of the factors in each factor set multiplied • Marginalization of factor set: inference in simplified network • Example: compute 23 X11 X11 X12 X12 X21 X22 X21 X22 X31 X32 X31 X32 1 2 3

Approximate Message Propagation • Input • Clique tree (or cluster graph) • Assignments of original factors to clusters/cliques • The factorized form of each cluster/clique • Can be represented by a network for each edge Ci—Cj that specifies the factorization (in previous examples we assumed empty network) • Two strategies for approximate message propagation • Sum-product message passing scheme • Belief update messages

Sum-Product Propagation • Same propagation scheme as in exact inference • Select a root • Propagate messages towards the root • Each cluster collects messages from its neighbors and sends outgoing messages when possible • Propagate messages from the root • Each message passing performs inference on cluster • Terminates in a fixed number of iterations • Note: final marginals at each variable are not exact

Two message passing schemes differ in approximate inference Message Passing: Belief Propagation • Same as BP but with approximate messages • Initialize the clique tree • For each clique Ci set • For each edge Ci—Cj set • While unset cliques exist • Select Ci—Cj • Send message from Ci to Cj • Marginalize the clique over the sepset • Update the belief at Cj • Update the sepset at Ci–Cj Approximation

Structured Variational Approx. • Select a simple family of distributions Q • Find QQ that maximizes F[PF,Q]

Mean Field Approximation • Q(x) = Q(Xi) • Q loses much of the information of PF • Approximation is computationally attractive • Every query in Q is simple to compute • Q is easy to represent X11 X12 X13 X11 X12 X13 X21 X22 X23 X21 X22 X23 X31 X32 X33 X31 X32 X33 PF – Markov grid network Q – Mean field network

Mean Field Approximation • The energy functional is easy to compute, even for networks where inference is complex

Mean Field Maximization • Maximizing the Energy Functional of Mean-Field • Find Q(x) = Q(Xi) that maximizes F[PF,Q] • Subject to for all i: xiQ(xi)=1

Mean Field Maximization • Theorem: Q(Xi) is a stationary point of the mean field given Q(X1),...Q(Xi-1),Q(Xi+1),...Q(Xn) if and only if • Proof: • To optimize Q(Xi) define the Lagrangian •  corresponds to the constraint that Q(Xi) is a distribution • We now compute the derivative of Li

Mean Field Maximization

Mean Field Maximization Setting the derivative to zero, and rearranging terms, we get: Taking exponents of both sides we get:

Global Approximate Inference