540 likes | 838 Views
Exact and approximate inference in probabilistic graphical models. Kevin Murphy MIT CSAIL UBC CS/Stats. www.ai.mit.edu/~murphyk/AAAI04. AAAI 2004 tutorial. Recommended reading . Cowell, Dawid, Lauritzen, Spiegelhalter, “Probabilistic Networks and Expert Systems“ 1999
E N D
Exact and approximate inference in probabilistic graphical models Kevin MurphyMIT CSAIL UBC CS/Stats www.ai.mit.edu/~murphyk/AAAI04 AAAI 2004 tutorial SP2-1
Recommended reading • Cowell, Dawid, Lauritzen, Spiegelhalter, “Probabilistic Networks and Expert Systems“ 1999 • Jensen 2001, “Bayesian Networks and Decision Graphs” • Jordan (due 2005) “Probabilistic graphical models” • Koller & Friedman (due 2005), “Bayes nets and beyond” • “Learning in graphical models”,edited M. Jordan SP2-2
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-3
2 reasons for approximate inference Low treewidth BUTNon-linear/ Non-Gaussian High tree width Chains N=nxn grid eg non-linear dynamical system Trees (no loops) eg (Bayesian) parameter estimation X1 X3 X2 Loopy graphs Y1 Y3 Y2 SP2-4
Complexity of approximate inference • Approximating P(Xq|Xe) to within a constant factor for all discrete BNs is NP-hard. • In practice, many models exhibit “weak coupling”, so we may safely ignore certain dependencies. • Computing P(Xq|Xe) for all polytrees with discrete and Gaussian nodes is NP-hard. • In practice, some of the modes of the posterior will have negligible mass. Dagum93 Lerner01 SP2-5
2 objective functions • Approximate true posterior P(h|v) by Q(h) • Variational: globally optimize all terms wrt simpler Q • Expectation propagation (EP): sequentially optimize each term P Q P Q min D(Q||P) min D(P||Q) Q=0 => P=0 P=0 => Q=0 SP2-6
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Variational • Loopy belief propagation • Expectation propagation • Graph cuts • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-7
Free energy • Variational goal: minimize D(P||Q) wrt Q, where Q has a simpler form than P • P(h,v) simpler than P(h|v), so use • Free energy is upper bound on neg log-likelihood SP2-8
Point estimation • Use • Minimize • Iterative Conditional Modes (ICM): • For each iteration, for each hi • Example: K-means clustering • Ignores uncertainty in P(h|v), P(|v) • Tends to get stuck in local minima Factors in markov blanket of hi SP2-9
Expectation Maximization (EM) • Point estimates for parameters (ML or MAP), full posterior for hidden vars. • E-step: minimize F(Q,P) wrt Q(h) • M-step: minimize F(Q,P) wrt Q(h) Exact inference Parameter prior Expected complete-data log-likelihood SP2-10
EM: tricks of the trade Neal98 • Generalized EM • Partial M-step: reduce F(Q,P) wrt Q(h)[e.g., gradient method] • Partial E-step: reduce F(Q,P) wrt Q(h)[approximate inference] • Avoiding local optima • Deterministic Annealing • Data resampling • Speedup tricks • Combine with conjugate gradient • Online/incremental updates Rose98 Elidan02 Salakhutdinov03 Bauer97,Neal98 SP2-11
Variational Bayes (VB) Ghahramani00,Beal02 • Use • For exponential family models with conjugate priors, this results in a generalized version of EM • E-step: modified inference to take into account uncertainty of parameters • M-step: optimize Q(h) using expected sufficient statistics • Variational Message Passing automates this, assuming a fully factorized (mean field) Q Winn04 variational-Bayes.org SP2-12
Variational inference for discrete state models with high treewidth • We assume the parameters are fixed. • We assume Q(h) has a simple form, so we can easily find • Mean field: • Structured variational: Xing04 Product of chains Mean field Grid MRF SP2-13
Variational inference for MRFs • Probability is exp(-energy) • Free energy = average energy - entropy SP2-14
Mean field for MRFs • Fully factorized approximation • Normalization constraint • Average energy • Entropy • Local minima satisfy SP2-15
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Variational • Loopy belief propagation • Expectation propagation • Graph cuts • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-16
BP vs mean field for MRFs • Mean field updates • BP updates • Every node i sends a different message to j • Empirically, BP much better than MF (e.g., MF is not exact even for trees) • BP is (attempting to) minimize the Bethe free energy Weiss01 Yedidia01 SP2-17
Bethe free energy • We assume the graph is a tree, in which case the following is exact • Constraints • Normalization • Marginalization • Average energy • Entropy di = #neighbors for node i SP2-18
BP minimizes Bethe free energy Yedidia01 • Theorem [Yedida, Freeman, Weiss]: fixed points of BP are local stationary points of the Bethe free energy • BP may not converge; other algorithms can directly minimize F, but are slower. • If BP does not converge, it often means F is a poor approximation SP2-19
3 1 2 3 1 2 6 4 5 6 4 5 Bethe Kikuchi Kikuchi free energy • Cluster groups of nodes together • Energy per region • Free energy per region • Kikuchi free energy Counting numbers SP2-20
Counting numbers 3 1 2 3 1 2 6 4 5 6 4 5 Bethe Kikuchi 12 23 14 25 36 45 56 Region graphs 1245 2356 1 2 3 4 5 6 25 C= -1 -2 -1 -1 -2 -1 C=1-(1+1)=-1 Fkikuchi is exact if region graph contains 2 levels (regions and intersections)and has no cycles – equivalent to junction tree! SP2-21
Generalized BP 3 1 2 • 2356 4578 5689 25 45 56 58 6 4 5 5 9 7 8 • Fkikuchu no longer exact, but more accurate than Fbethe • Generalized BP can be used to minimize Fkikuchi • This method of choosing regions is called the “cluster variational method” • In the limit, we recover the junction tree algorithm. Welling04 SP2-22
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Variational • Loopy belief propagation • Expectation propagation • Graph cuts • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-23
Expectation Propagation (EP) Minka01 • EP = iterated assumed density filtering • ADF = recursive Bayesian estimation interleaved with projection step • Examples of ADF: • Extended Kalman filtering • Moment-matching (weak marginalization) • Boyen-Koller algorithm • Some online learning algorithms SP2-24
Assumed Density Filtering (ADF) x Recursive Bayesian estimation(sequential updating of posterior) Y1 Yn • If p(yi|x) not conjugate to p(x), then p(x|y1:i) may not be tractably representable • So project posterior back to representable family • And repeat update project Projection becomes moment matching SP2-25
Expectation Propagation • ADF is sensitive to the order of updates. • ADF approximates each posterior myopically. • EP: iteratively approximate each term. intractable Simple, non-iterative, inaccurate = EP Simple, iterative, accurate After Ghahramani SP2-26
Expectation Propagation • Input: • Initialize: • Repeat • For i=0..N • Deletion: • Projection: • Inclusion: • Until convergence • Output: q(x) After Ghahramani SP2-27
BP is a special case of EP • BP assumes fully factorized • At each iteration, for each factor i, for each node k, KL projection matches moments (computes marginals by absorbing from neighbors) Xn1 fj Xk fi Xn2 SP2-28
TreeEP Minka03 • TreeEP assumes q(x) is represented by a tree (regardless of “true” model topology). • We can use the Jtree algorithm to do the moment matching at each iteration. • Faster and more accurate than LBP. • Faster and comparably accurate to GBP. SP2-29
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Variational • Loopy belief propagation • Expectation propagation • Graph cuts • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-30
MPE in MRFs • MAP estimation = energy minimization • Simplifications: • Only pairwise potentials: Eijk=0 etc • Special form for potentials • Binary variables xi2 {0,1} SP2-31
Kinds of potential • Metric • Semi-metric: satisfies (2) & (3) • Piecewise constant, eg. • Potts model (metric) • Piecewise smooth, eg. • Semi-metric • Metric • Discontinuity-preserving potentials avoid oversmoothing SP2-32
s C-A B+C-A-D xi xj C-D t GraphCuts Kolmogorov04 • Thm: we can find argmin E(x) for binary variables and pairwise potentials in at most O(N3) time using maxflow/ mincut algorithm on the graph below iff potentials are submodular i.e., • Metric potentials (eg. Potts) are always submodular. • Thm: the general case (eg. non-binary or non-submodular) is NP-hard. where SP2-33
Finding strong local minimum • For the non-binary case, we can optimum wrt some large space of moves by iteratively solving binary subproblems. • -expansion: any pixel can change to • - swap: any can switch to and vice versa Picture from Zabih SP2-34
Finding strong local minimum • Start with arbitrary assignment f • Done := false • While ~done • Done := true • For each label • Find • If E(f’) < E(f) then done := false; f := f’ Binary subproblem! SP2-35
Properties of the 2 algorithms • -expansion • Requires V to be submodular (eg metric) • O(L) per cycle • Factor of 2c(V) within optimal • c=1 for Potts model • - swap • Requires V to be semi-metric • O(L2) per cycle • No comparable theoretical guarantee, but works well in practice SP2-36
Summary of inference methods for pairwise MRFs • Marginals • Mean field • Loopy/ generalized BP (sum-product) • EP • Gibbs sampling • Swendsen-Wang • MPE/ Viterbi • Iterative conditional modes (ICM) • Loopy/generalized BP (max-product) • Graph cuts • Simulated annealing See Boykov01, Weiss01 and Tappen03 for some empirical comparisons SP2-37
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-38
Monte Carlo (sampling) methods • Goal: estimate • e.g., • Draw N independent samples xr ~ P • Hard to draw (independent) samples from P Accuracy is independentof dimensionality of X SP2-39
Importance Sampling • We sample from Q(x) and reweight Require Q(x)>0 for all where P(x)>0 P* Q* SP2-40
Importance Sampling for BNs (likelihood weighting) • Input: CPDs P(Xi|Xi), evidence xE • Output: • For each sample r • wr = 1 • For each node i in topological order • If Xi is observed • Then xir = xiE; wr = wr * P(Xi=xiE|Xi= xir) • Else xir ~ P(Xi|xir) SP2-41
C C S S R R W W Drawbacks of importance sampling • Sample given upstream evidence, weight by downstream evidence. • Evidence reversal = modify model to make all observed nodes be parents – can be expensive • Does not scale to high dimensional spaces, even if Q similar to P, since variance of weights too high. SP2-42
X1 X3 X2 Y1 Y3 Y2 Sequential importance sampling (particle filtering) Arulampalam02,Doucet01 • Apply importance sampling to a (nonlinear, nonGaussian) dynamical system. • Resample particles wp wt • Unlikely hypotheses get replaced SP2-43
Markov Chain Monte Carlo (MCMC) Neal93,Mackay98 • Draw dependent samples xt from a chain with transition kernel T(x’ | x), s.t. • P(x) is the stationary distribution • The chain is ergodic (all states can get to the stationary states) • If T satisfies detailed balancethen P = SP2-44
Metropolis Hastings • Sample xt ~ Q(x’|xt-1) • Accept new state with probability • Satisfies detailed balance SP2-45
Gibbs sampling • Metropolis method where Q is defined in terms of conditionals P(Xi|X-i). • Acceptance rate = 1. • For graphical model, only need to condition on the Markov blanket See BUGS software SP2-46
Difficulties with MCMC • May take long time to “mix” (converge to stationary distribution). • Hard to know when chain has mixed. • Simple proposals exhibit random walk behavior. • Hybrid Monte Carlo (use gradient information) • Swendsen-Wang (large moves for Ising model) • Heuristic proposals SP2-47
Outline • Introduction • Exact inference • Approximate inference • Deterministic • Stochastic (sampling) • Hybrid deterministic/ stochastic SP2-48
Comparison of deterministic and stochastic methods • Deterministic • fast but inaccurate • Stochastic • slow but accurate • Can handle arbitrary hypothesis space • Combine best of both worlds (hybrid) • Use smart deterministic proposals • Integrate out some of the states, sample the rest (Rao-Blackwellization) • Non-parametric BP (particle filtering for graphs) SP2-49
Examples of deterministic proposals • State estimation • Unscented particle filter • Machine learning • Variational MCMC • Computer vision • Data driven MCMC Merwe00 deFreitas01 Tu02 SP2-50