1 / 42

Divergence measures and message passing

Explore divergence measures and message passing algorithms in distributed optimization. Learn about the structure of alpha-divergence, properties, and practical examples. Discover distributed divergence minimization and the hierarchy of algorithms.

sfarrow
Download Presentation

Divergence measures and message passing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group

  2. Message-Passing Algorithms

  3. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture

  4. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture

  5. Estimation Problem b y d e x f z a c

  6. 0 1 0 1 0 1 ? ? ? Estimation Problem b y d e x f z a c

  7. Estimation Problem y x z

  8. Estimation Problem Queries: Want to do these quickly

  9. Belief Propagation y x z

  10. Belief Propagation Final y x z

  11. Belief Propagation Marginals: (Exact) (BP) Normalizing constant: 0.45 (Exact) 0.44 (BP) Argmax: (0,0,0) (Exact) (0,0,0) (BP)

  12. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture

  13. Message Passing = Distributed Optimization • Messages represent a simpler distribution q(x) that approximates p(x) • A distributed representation • Message passing = optimizing q to fit p • q stands in for p when answering queries • Parameters: • What type of distribution to construct (approximating family) • What cost to minimize (divergence measure)

  14. How to make a message-passing algorithm • Pick an approximating family • fully-factorized, Gaussian, etc. • Pick a divergence measure • Construct an optimizer for that measure • usually fixed-point iteration • Distribute the optimization across factors

  15. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture

  16. Let p,q be unnormalized distributions Kullback-Leibler (KL) divergence Alpha-divergence ( is any real number) Asymmetric, convex

  17. Examples of alpha-divergence

  18. Minimum alpha-divergence q is Gaussian, minimizes D(p||q) = -1

  19. Minimum alpha-divergence q is Gaussian, minimizes D(p||q) = 0

  20. Minimum alpha-divergence q is Gaussian, minimizes D(p||q) = 0.5

  21. Minimum alpha-divergence q is Gaussian, minimizes D(p||q) = 1

  22. Minimum alpha-divergence q is Gaussian, minimizes D(p||q) = 1

  23. Properties of alpha-divergence • £ 0 seeks the mode with largest mass (not tallest) • zero-forcing: p(x)=0 forces q(x)=0 • underestimates the support of p • ¸ 1 stretches to cover everything • inclusive: p(x)>0 forces q(x)>0 • overestimates the support of p [Frey,Patrascu,Jaakkola,Moran 00]

  24. Structure of alpha space inclusive (zero avoiding) zero forcing BP, EP MF  0 1 TRW FBP, PEP

  25. Other properties • If q is an exact minimum of alpha-divergence: • Normalizing constant: • If a=1: Gaussian q matches mean,variance of p • Fully factorized q matches marginals of p

  26. Two-node example x y • q is fully-factorized, minimizes a-divergence to p • q has correct marginals only for  = 1 (BP)

  27. Two-node example Bimodal distribution = 1 (BP) • = 0 (MF) •  0.5

  28. Two-node example Bimodal distribution = 1

  29. Lessons • Neither method is inherently superior – depends on what you care about • A factorized approx does not imply matching marginals (only for =1) • Adding y to the problem can change the estimated marginal for x (though true marginal is unchanged)

  30. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture

  31. Distributed divergence minimization

  32. Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation:

  33. Global divergence to local divergence • Global divergence: • Local divergence:

  34. Message passing • Messages are passed between factors • Messages are factor approximations: • Factor a receives • Minimize local divergence to get • Send to other factors • Repeat until convergence • Produces all 6 algs

  35. Global divergence vs. local divergence In general, local ¹ global • but results are similar • BP doesn’t minimize global KL, but comes close MF 0  local ¹ global local = global no loss from message passing

  36. Experiment • Which message passing algorithm is best at minimizing global Da(p||q)? • Procedure: • Run FBP with various aL • Compute global divergence for various aG • Find best aL (best alg) for each aG

  37. Results • Average over 20 graphs, random singleton and pairwise potentials: exp(wij xi xj) • Mixed potentials (w ~ U(-1,1)): • best aL = aG (local should match global) • FBP with same a is best at minimizing Da • BP is best at minimizing KL

  38. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture

  39. Hierarchy of algorithms • Power EP • exp family • D(p||q) • Structured MF • exp family • KL(q||p) • FBP • fully factorized • D(p||q) • EP • exp family • KL(p||q) • MF • fully factorized • KL(q||p) • TRW • fully factorized • D(p||q),a>1 • BP • fully factorized • KL(p||q)

  40. Matrix of algorithms • MF • fully factorized • KL(q||p) • Structured MF • exp family • KL(q||p) • TRW • fully factorized • D(p||q),a>1 approximation family Other families? (mixtures) divergence measure • BP • fully factorized • KL(p||q) • EP • exp family • KL(p||q) • FBP • fully factorized • D(p||q) • Power EP • exp family • D(p||q) Other divergences?

  41. Other Message Passing Algorithms Do they correspond to divergence measures? • Generalized belief propagation [Yedidia,Freeman,Weiss 00] • Iterated conditional modes [Besag 86] • Max-product belief revision • TRW-max-product [Wainwright,Jaakkola,Willsky 02] • Laplace propagation [Smola,Vishwanathan,Eskin 03] • Penniless propagation [Cano,Moral,Salmerón 00] • Bound propagation [Leisink,Kappen 03]

  42. Future work • Understand existing message passing algorithms • Understand local vs. global divergence • New message passing algorithms: • Specialized divergence measures • Richer approximating families • Other ways to minimize divergence

More Related