Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models Kevin MurphyMIT CSAIL UBC CS/Stats www.ai.mit.edu/~murphyk/AAAI04 AAAI 2004 tutorial SP2-1

Recommended reading • Cowell, Dawid, Lauritzen, Spiegelhalter, “Probabilistic Networks and Expert Systems“ 1999 • Jensen 2001, “Bayesian Networks and Decision Graphs” • Jordan (due 2005) “Probabilistic graphical models” • Koller & Friedman (due 2005), “Bayes nets and beyond” • “Learning in graphical models”,edited M. Jordan SP2-2

Outline • Introduction • Exact inference • Approximate inference • Deterministic • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-3

2 reasons for approximate inference Low treewidth BUTNon-linear/ Non-Gaussian High tree width Chains N=nxn grid eg non-linear dynamical system Trees (no loops) eg (Bayesian) parameter estimation X1 X3 X2 Loopy graphs Y1 Y3 Y2  SP2-4

Complexity of approximate inference • Approximating P(Xq|Xe) to within a constant factor for all discrete BNs is NP-hard. • In practice, many models exhibit “weak coupling”, so we may safely ignore certain dependencies. • Computing P(Xq|Xe) for all polytrees with discrete and Gaussian nodes is NP-hard. • In practice, some of the modes of the posterior will have negligible mass. Dagum93 Lerner01 SP2-5

2 objective functions • Approximate true posterior P(h|v) by Q(h) • Variational: globally optimize all terms wrt simpler Q • Expectation propagation (EP): sequentially optimize each term P Q P Q min D(Q||P) min D(P||Q) Q=0 => P=0 P=0 => Q=0 SP2-6

Outline • Introduction • Exact inference • Approximate inference • Deterministic • Variational • Loopy belief propagation • Expectation propagation • Graph cuts • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-7

Free energy • Variational goal: minimize D(P||Q) wrt Q, where Q has a simpler form than P • P(h,v) simpler than P(h|v), so use • Free energy is upper bound on neg log-likelihood SP2-8

Point estimation • Use • Minimize • Iterative Conditional Modes (ICM): • For each iteration, for each hi • Example: K-means clustering • Ignores uncertainty in P(h|v), P(|v) • Tends to get stuck in local minima Factors in markov blanket of hi SP2-9

Expectation Maximization (EM) • Point estimates for parameters (ML or MAP), full posterior for hidden vars. • E-step: minimize F(Q,P) wrt Q(h) • M-step: minimize F(Q,P) wrt Q(h) Exact inference Parameter prior Expected complete-data log-likelihood SP2-10

EM: tricks of the trade Neal98 • Generalized EM • Partial M-step: reduce F(Q,P) wrt Q(h)[e.g., gradient method] • Partial E-step: reduce F(Q,P) wrt Q(h)[approximate inference] • Avoiding local optima • Deterministic Annealing • Data resampling • Speedup tricks • Combine with conjugate gradient • Online/incremental updates Rose98 Elidan02 Salakhutdinov03 Bauer97,Neal98 SP2-11

Variational Bayes (VB) Ghahramani00,Beal02 • Use • For exponential family models with conjugate priors, this results in a generalized version of EM • E-step: modified inference to take into account uncertainty of parameters • M-step: optimize Q(h) using expected sufficient statistics • Variational Message Passing automates this, assuming a fully factorized (mean field) Q Winn04 variational-Bayes.org SP2-12

Variational inference for discrete state models with high treewidth • We assume the parameters are fixed. • We assume Q(h) has a simple form, so we can easily find • Mean field: • Structured variational: Xing04 Product of chains Mean field Grid MRF SP2-13

Variational inference for MRFs • Probability is exp(-energy) • Free energy = average energy - entropy SP2-14

Mean field for MRFs • Fully factorized approximation • Normalization constraint • Average energy • Entropy • Local minima satisfy SP2-15

BP vs mean field for MRFs • Mean field updates • BP updates • Every node i sends a different message to j • Empirically, BP much better than MF (e.g., MF is not exact even for trees) • BP is (attempting to) minimize the Bethe free energy Weiss01 Yedidia01 SP2-17

Bethe free energy • We assume the graph is a tree, in which case the following is exact • Constraints • Normalization • Marginalization • Average energy • Entropy di = #neighbors for node i SP2-18

BP minimizes Bethe free energy Yedidia01 • Theorem [Yedida, Freeman, Weiss]: fixed points of BP are local stationary points of the Bethe free energy • BP may not converge; other algorithms can directly minimize F, but are slower. • If BP does not converge, it often means F is a poor approximation SP2-19

3 1 2 3 1 2 6 4 5 6 4 5 Bethe Kikuchi Kikuchi free energy • Cluster groups of nodes together • Energy per region • Free energy per region • Kikuchi free energy Counting numbers SP2-20

Counting numbers 3 1 2 3 1 2 6 4 5 6 4 5 Bethe Kikuchi 12 23 14 25 36 45 56 Region graphs 1245 2356 1 2 3 4 5 6 25 C= -1 -2 -1 -1 -2 -1 C=1-(1+1)=-1 Fkikuchi is exact if region graph contains 2 levels (regions and intersections)and has no cycles – equivalent to junction tree! SP2-21

Generalized BP 3 1 2 • 2356 4578 5689 25 45 56 58 6 4 5 5 9 7 8 • Fkikuchu no longer exact, but more accurate than Fbethe • Generalized BP can be used to minimize Fkikuchi • This method of choosing regions is called the “cluster variational method” • In the limit, we recover the junction tree algorithm. Welling04 SP2-22

Expectation Propagation (EP) Minka01 • EP = iterated assumed density filtering • ADF = recursive Bayesian estimation interleaved with projection step • Examples of ADF: • Extended Kalman filtering • Moment-matching (weak marginalization) • Boyen-Koller algorithm • Some online learning algorithms SP2-24

Assumed Density Filtering (ADF) x Recursive Bayesian estimation(sequential updating of posterior) Y1 Yn • If p(yi|x) not conjugate to p(x), then p(x|y1:i) may not be tractably representable • So project posterior back to representable family • And repeat update project Projection becomes moment matching SP2-25

Expectation Propagation • ADF is sensitive to the order of updates. • ADF approximates each posterior myopically. • EP: iteratively approximate each term. intractable Simple, non-iterative, inaccurate = EP Simple, iterative, accurate After Ghahramani SP2-26

Expectation Propagation • Input: • Initialize: • Repeat • For i=0..N • Deletion: • Projection: • Inclusion: • Until convergence • Output: q(x) After Ghahramani SP2-27

BP is a special case of EP • BP assumes fully factorized • At each iteration, for each factor i, for each node k, KL projection matches moments (computes marginals by absorbing from neighbors) Xn1 fj Xk fi Xn2 SP2-28

TreeEP Minka03 • TreeEP assumes q(x) is represented by a tree (regardless of “true” model topology). • We can use the Jtree algorithm to do the moment matching at each iteration. • Faster and more accurate than LBP. • Faster and comparably accurate to GBP. SP2-29

MPE in MRFs • MAP estimation = energy minimization • Simplifications: • Only pairwise potentials: Eijk=0 etc • Special form for potentials • Binary variables xi2 {0,1} SP2-31

Kinds of potential • Metric • Semi-metric: satisfies (2) & (3) • Piecewise constant, eg. • Potts model (metric) • Piecewise smooth, eg. • Semi-metric • Metric • Discontinuity-preserving potentials avoid oversmoothing SP2-32

s C-A B+C-A-D xi xj C-D t GraphCuts Kolmogorov04 • Thm: we can find argmin E(x) for binary variables and pairwise potentials in at most O(N3) time using maxflow/ mincut algorithm on the graph below iff potentials are submodular i.e., • Metric potentials (eg. Potts) are always submodular. • Thm: the general case (eg. non-binary or non-submodular) is NP-hard. where SP2-33

Finding strong local minimum • For the non-binary case, we can optimum wrt some large space of moves by iteratively solving binary subproblems. • -expansion: any pixel can change to  • - swap: any  can switch to  and vice versa Picture from Zabih SP2-34

Finding strong local minimum • Start with arbitrary assignment f • Done := false • While ~done • Done := true • For each label  • Find • If E(f’) < E(f) then done := false; f := f’ Binary subproblem! SP2-35

Properties of the 2 algorithms • -expansion • Requires V to be submodular (eg metric) • O(L) per cycle • Factor of 2c(V) within optimal • c=1 for Potts model • - swap • Requires V to be semi-metric • O(L2) per cycle • No comparable theoretical guarantee, but works well in practice SP2-36

Summary of inference methods for pairwise MRFs • Marginals • Mean field • Loopy/ generalized BP (sum-product) • EP • Gibbs sampling • Swendsen-Wang • MPE/ Viterbi • Iterative conditional modes (ICM) • Loopy/generalized BP (max-product) • Graph cuts • Simulated annealing See Boykov01, Weiss01 and Tappen03 for some empirical comparisons SP2-37

Monte Carlo (sampling) methods • Goal: estimate • e.g., • Draw N independent samples xr ~ P • Hard to draw (independent) samples from P Accuracy is independentof dimensionality of X SP2-39

Importance Sampling • We sample from Q(x) and reweight Require Q(x)>0 for all where P(x)>0 P* Q* SP2-40

Importance Sampling for BNs (likelihood weighting) • Input: CPDs P(Xi|Xi), evidence xE • Output: • For each sample r • wr = 1 • For each node i in topological order • If Xi is observed • Then xir = xiE; wr = wr * P(Xi=xiE|Xi= xir) • Else xir ~ P(Xi|xir) SP2-41

C C S S R R W W Drawbacks of importance sampling • Sample given upstream evidence, weight by downstream evidence. • Evidence reversal = modify model to make all observed nodes be parents – can be expensive • Does not scale to high dimensional spaces, even if Q similar to P, since variance of weights too high. SP2-42

X1 X3 X2 Y1 Y3 Y2 Sequential importance sampling (particle filtering) Arulampalam02,Doucet01 • Apply importance sampling to a (nonlinear, nonGaussian) dynamical system. • Resample particles wp wt • Unlikely hypotheses get replaced SP2-43

Markov Chain Monte Carlo (MCMC) Neal93,Mackay98 • Draw dependent samples xt from a chain with transition kernel T(x’ | x), s.t. • P(x) is the stationary distribution • The chain is ergodic (all states can get to the stationary states) • If T satisfies detailed balancethen P =  SP2-44

Metropolis Hastings • Sample xt ~ Q(x’|xt-1) • Accept new state with probability • Satisfies detailed balance SP2-45

Gibbs sampling • Metropolis method where Q is defined in terms of conditionals P(Xi|X-i). • Acceptance rate = 1. • For graphical model, only need to condition on the Markov blanket See BUGS software SP2-46

Difficulties with MCMC • May take long time to “mix” (converge to stationary distribution). • Hard to know when chain has mixed. • Simple proposals exhibit random walk behavior. • Hybrid Monte Carlo (use gradient information) • Swendsen-Wang (large moves for Ising model) • Heuristic proposals SP2-47

Comparison of deterministic and stochastic methods • Deterministic • fast but inaccurate • Stochastic • slow but accurate • Can handle arbitrary hypothesis space • Combine best of both worlds (hybrid) • Use smart deterministic proposals • Integrate out some of the states, sample the rest (Rao-Blackwellization) • Non-parametric BP (particle filtering for graphs) SP2-49

Examples of deterministic proposals • State estimation • Unscented particle filter • Machine learning • Variational MCMC • Computer vision • Data driven MCMC Merwe00 deFreitas01 Tu02 SP2-50

Exact and approximate inference in probabilistic graphical models

Exact and approximate inference in probabilistic graphical models

Presentation Transcript

Exact and approximate inference in probabilistic graphical models

Exact Inference

Approximate Probabilistic Optimization Using Exact-Capacity-Approximate-Response-Distribution (ECARD)

Exact and Approximate Inference in Associative Hierarchical Networks using Graph Cuts

Exact Inference on Graphical Models

Graphical Models - Inference -

Inference using Graphical Models and Software Tools

Lecture 22: Inference in Graphical Models

Probabilistic graphical models

Probabilistic Graphical Models

Directed Graphical Probabilistic Models:

Probabilistic graphical models and regulatory networks

Probabilistic Graphical Models

Probabilistic and Possibilistic Graphical Models in Complex Applications

Query-Specific Learning and Inference for Probabilistic Graphical Models

Probabilistic Graphical Models

Causal Inference and Graphical Models

Approximate Inference