680 likes | 875 Views
Exact Inference on Graphical Models. Samson Cheung. Outline . What is inference? Overview Preliminaries Three general algorithms for inference Elimination Algorithm Belief Propagation Junction Tree. What is inference?.
E N D
Exact Inference on Graphical Models Samson Cheung
Outline • What is inference? • Overview • Preliminaries • Three general algorithms for inference • Elimination Algorithm • Belief Propagation • Junction Tree
What is inference? Given a fully specified joint distribution (database), inference is to query information about some random variables, given knowledge about other random variables. Query about XF? Information about XF Evidence: xE
Conditional/Marginal Prob. Ex. Visual Tracking – you want to compute the conditional to quantify the uncertainty in your tracking Conditional of XF? Evidence: xE
Maximum A Posterior Estimate Error Control – Care about the decoded symbol. Difficult to compute the error probability in practice due to high bandwidth. Most likely value of XF? Evidence: xE
Inferencing is not easy Potential: (p,q) = exp(-|p-q|) • Computing marginals or MAP requires global communication! Evidence Marginal: P(p,q)=G\{p,q} p(G)
Outline • What is inference? • Overview • Preliminaries • Three general algorithms for inference • Elimination Algorithm • Belief Propagation • Junction Tree
EXACT APPROXIMATE • Iterative Conditional Modes • EM • Mean field • Variational techniques • Structural Variational techniques • Monte-Carlo • Expectation Propagation • Loopy belief propagation General Graph Polytrees ELIMINATION ALGORITHM JUNCTION TREE BELIEF PROPAGATION Inference Algorithms 10-100 nodes: Expert systems Diagnostics Simulation >1000 nodes: Image Processing Vision Physics General Inference Algorithms NP -hard
Outline • What is inferencing? • Overview • Preliminaries • Three general algorithms for inferencing • Elimination Algorithm • Junction Tree • Probability Propagation
Calculating Marginal Introducing evidence • Inferencing : summing or maxing “part” of the joint distribution • In order not to be sidetrack by the evidence node, we roll them into the joint by considering • Hence we will be summing or maxing the entire joint distribution
X1 π π X2 X3 X4 X5 X6 Moralization • Every directed graph can be represented as an undirected by linking up parents who have the same child. • Deal only with undirected graph X1 P(X1) P(X2|X1) P(X3|X1) P(X4|X1) P(X5|X2,X3) P(X6|X3,X4) (X1,X2,X3) (X1,X3,X4) (X2,X3,X5) (X3,X4,X6) X2 X3 X4 X5 X6
(X1,X2,X3,X4) (X2,X3,X5) (X3,X4,X6) π X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X6 X6 Adding edges is “okay” • The pdf of an undirected graph can ALWAYS be expressed by the same graph with extra edges added. • A graph with more edge • Lose important conditional independence information (okay for inferencing, not good for parameter est.) • Use more storage (why?) (X1,X2,X3) (X1,X3,X4) (X2,X3,X5) (X3,X4,X6)
C5 Separator: C1 C3={X2,X3} C6 C1 C2 C4 C3 Undirected graph and Clique graph C1(X1,X2,X3) C2(X1,X3,X4) C3(X2,X3,X5) C4(X3,X4,X6) C5(X7,X8,X9) C6(X1,X7) • Clique graph • Each node is a clique from the parametrization • An edge between two nodes (cliques) if the two nodes (cliques) share common variables X9 X7 X8 X1 X2 X3 X4 X5 X6
Outline • What is inference? • Overview • Preliminaries • Three general algorithms for inference • Elimination Algorithm • Belief Propagation • Junction Tree
Computing Marginal • Need to marginalize x2,x3,x4,x5 • We need to sum N5 terms (N is the number of symbols for each r.v.) • Can we do better? X4 X2 X1 X5 X3
C: O(N3) S: O(N2) C: O(N2) S: O(N) C: O(N3) S: O(N2) C: O(N2) S: O(N) Elimination (Marginalization) Order • Try to marginalize in this order: x5, x4, x3, x2 • Complexity: O(KN3), Storage: O(N2) K=# r.v.s
MAP is the same • Just replace summation with max • Note • All the m’s are different from marginal • Need to remember the best configuration as you go
Kill X2 Kill X3 Kill X4 Kill X5 Graphical Interpretation List of active potential functions: X4 X2 X1 m2(X1) C1(X1,X2) m4(X2) m3(X1,X2) C1(X1,X2) m4(X2) m3(X1,X2) C1(X1,X2) C2(X1,X3) C3(X2,X5) C4(X3,X5) C5(X2,X4) C1(X1,X2) C2(X1,X3) C3(X2,X5) C4(X3,X5) C5(X2,X4) C1(X1,X2) C2(X1,X3) C5(X2,X4) m5(X2,X3) C1(X1,X2) C2(X1,X3) C5(X2,X4) m5(X2,X3) C1(X1,X2) C2(X1,X3) m4(X2) m5(X2,X3) C1(X1,X2) C2(X1,X3) m4(X2) m5(X2,X3) X5 X3
X4 X2 X1 X5 First real link to graph theory • Reconstituted Graph = the graph that contain all the extra edges after the elimination • Depends on the elimination order! X3 The complexity of graph elimination is O(NW), where W is the size of the largest clique in the reconstituted graph Proof : Exercise
Finding the optimal order • To minimize the clique size turns out to be NP-hard1 • Greedy algorithm2: • Find the node v in G that connects to the least number of neighbors • Eliminate v and connect all its neighbors • Go back to 1 until G becomes a clique • Current best techniques use other simulated annealing3 or approximated algorithm4 1 S. Arnborg, D.G. Corneil, A. Proskurowski, Complexity of finding embeddings in a k-tree, SIAM J.Algebraic and Discrete Methods 8 (1987) 277–284. 2 D. Rose, Triangulated graphs and the elimination process, J. Math. Anal. Appl. 32 (1974) 597–609. 3U. Kjærulff, Triangulation of graph-algorithms giving small total state space, Technical Report R 90-09, Department of Mathematics and Computer Science, Aalborg University, Denmark, 1990. 4A. Becker, D. Geiger, “A sufficiently fast algorithm for finding close to optimal clique trees,” Arificial Intelligence 125 (2001) 3-17
Largest clique: 4 Grow linearly with dimension (?) This is serious • One of the most commonly used graphical model in vision is Markov Random Field • Try to find a elimination order of this model. (p,q) = exp(-|p-q|) Pixel: I(x,y)
Outline • What is inference? • Overview • Preliminaries • Three general algorithms for inference • Elimination Algorithm • Belief Propagation • Junction Tree
What about other marginals? • We have just computed P(X1). • What if I need to compute P(X1) or P(X5) ? • Definitely, some part of the calculation can be reused! Ex. m5(X2,X3) is the same for both! X4 X2 X1 X5 X3
Focus on trees • Focus on tree like structures: • Why trees? Directed Tree = undirected tree after moralization Undirected Tree
Why trees? • No moralization is necessary • There is a natural elimination ordering with query node as root • Depth first search : all children before parent • All sub-trees with no evidence nodes can be ignored (Why? Exercise for the undirected graph)
mji(xi) Elimination on trees When we eliminate node j, the new potential function must be • A function of xi • Any other nodes? • nothing in the sub-tree below j (already eliminated) • nothing from other sub-trees, since the graph is a tree • only i, from ij which relates i and j Think of the new potential functions as a message mji(xi) from node j to node i
What is in the message? This message is created by summing over j the product of all earlier messages mkj(xj) sent to j as well as E(xj) (if j is an evidence node). • c(j) = children of node j • E(xj) = δ(xj=xj) if j is an evidence node; 1 otherwise
Elimination = Passing message upward • After passing the message up to the query (root) node, we compute the conditional: • What about answering other queries? = query node (need 3 messages)
Messages are reused! • We can compute all possible messages in only double the amount of work it takes to do one query. • Then we take the product of relevant messages to get marginals. Even though the naive approach (rerun Elimination) needs to compute N(N-1)messages to find marginals for all N query nodes, there are only 2(N-1) possible messages.
Computing all possible messages • Idea: respect the following Message-Passing-Protocol: A node can send a message to a neighbour only when it has received messages from all its other neighbours. • Protocol is realizable: designate one node (arbitrarily) as the root. • Collect messages inward to root then distribute back out to leaves.
mij mji mjk mjl mkj mlj Belief Propagation i j k l
Belief Propagation (sum-product) • Choose a root node (arbitrarily or as first query node). • If j is an evidence node, E(xj) = (xj=xj), else E(xj) = 1 • Pass messages from leaves up to root and then back down using: • Given messages, compute marginals using:
MAP is the same (max-product) • Choose a root node arbitrarily. • If j is an evidence node, E(xj) = (xj=xj), else E(xj) = 1 • Pass messages from leaves up to root using: • Remember which choice of xj = xj* yielded maximum. • Given messages, compute max value using any node i: • Retrace steps from root back to leaves recalling best xj to get the maximizing argument (configuration) x.
Corresponding factor graph IS A TREE After moralization “Tree”-like graphs work too • Pearl (1988) shows that BP works for factor tree • See Jordan Chapter 4 for more details This is not a directed tree
Outline • What is inference? • Overview • Preliminaries • Three general algorithms for inference • Elimination Algorithm • Belief Propagation • Junction Tree
What about arbitrary graphs? • BP only works on tree-like graphs • Question: Is there an algorithm for general graph? • Also, after BP, we get the marginal for each INDVIDUAL random variables • But the graph is characterized by cliques • Question: Can we get the marginal for every clique?
Mini-outline • Back to Reconstituted Graph • Three equivalent concepts • Triangulated graph – easy to validate • Decomposable graph – link to probability • Junction Tree – computational inference • Junction Tree Algorithm • Example
Back to Reconstituted graph The reconstituted graph is a very important type of graph: triangulated (chordal) graph • Definition: A graph is triangulated if any loop with 4 or more nodes will have a chord. Non-triangulated All trees are triangulated triangulated All cliques are triangulated
Added during eliminationchordal Proof • Prove for any N-node graph, the reconstituted graph after elimination is triangulated. • Proof: By induction • N=1 : trivial • Assume N=k is true. • N=k+1 case: Reconstituted graph with k nodes triangulated v v = first node eliminated
Lessons from graph theory • Graph coloring problem: find the smallest number of vertex colors so that adjacent colors are different = chromatic number • Sample application 1: Scheduling • Node = tasks • Edge = two tasks are not compatible • Coloring = Number of parallel tasks • Sample application 2 : Communication • Node = symbols • Edge = two symbols may produce the same output due to transmission error • Largest set of vertices with the same color = number of symbols that can be reliably sent
Lesson from graph theory • Determining the chromatic number is NP-hard • Not so for a general type of graph called Perfect Graph • Definition: = the size of the largest clique • Triangulated graph is an important type of perfect graphs. • Strong Perfect Graph Conjecture was proved in 2002 (148-page!) • Bottom line: Triangulated graph is “algorithmically friendly” – very easy to check whether a graph is triangulated and to compute properties from such a graph.
Link to Probability: Graph Decomposition • Definition: Given a graph G, a triple (A,B,S) with Vertex(G) = ABS is a decomposition G if • S separates A and B (i.e. every path from aA to bB must past through S. • S is a clique • Definition: G is decomposable if • G is complete or • There exist a decomposition (A,B,S) of G such that AS and BS are decomposable. S A B
What’s the big deal? Decomposable graph can be parametrized by marginals! If G is decomposable, then where C1,C2, …,CNare cliques in G, and S1,S2, …,SN-1 are (special) separators between cliques. Notice there are one less separators than cliques. Equivalently, we can say that G can parameterized by marginals p(xC) and ratios of marginals, p(xC)/p(xS)
This is not true in general • If the graph can be expressed in terms of a product marginals or ratio of marginals, at least one of the potentials is a marginal. • However, f(XAB) is not a constant C D B A
Proof : A B Proof by induction: G can be decomposed into A,B, and S, where AS and B S are decomposable; S separates A and B and is complete. S All cliques are subsets of either AS or B S
Continue Recursively apply on AS and BS based on induction assumption.
So what? It turns out that Triangulated Graph Decomposable Graph Triangulated Graph Decomposable Graph Parametrized by marginals Nice algorithmically
Decomposable Triangulation Prove by induction: If G is complete, it is triangulated. Otherwise By IA, GAS and GBS are triangulated and thus all cycles in them have a chord. The case we need to consider is the cycle that span A, B and S. But S is complete, so it must have a chord! QED A B S
B A a b TriangulationDecomposable S Prove by induction. Let G be a triangulated graph with N nodes. Show is G can be decomposed into (A,B,S). If G’s complete, done. If not, choose non-adjacent a and b. S = smallest set that intersects with all paths between a and b. A = all nodes in G\S reached by a B = all nodes in G\S reached by b Cleary A and B are separated by S.
b1 a1 b2 a2 TriangulationDecomposable c S Need to prove S is complete. Consider arbitrary c,dS. There is a path acb such that c is the only node in S. If not, then S is not minimum as c can be put into either A or B. Similarly, there is a path adb. Now we a cycle. Since G is triangulated, this cycle must have a chord. Since S separates A and B, the chord must be entirely in AS or BS. Keep shrinking the cycle and eventually there must be a chord between c and d, hence S must be complete. B A a b d