1 / 79

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes. Michael I. Jordan University of California, Berkeley INRIA. June 12, 2013. Co-authors : Tamara Broderick, Brian Kulis. A Typical Conversation in Silicon Valley.

rod
Download Presentation

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Michael I. Jordan University of California, Berkeley INRIA June 12, 2013 Co-authors: Tamara Broderick, Brian Kulis

  2. A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in”

  3. A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?”

  4. A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…”

  5. A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…” • Them: “Also, I need a decent answer within an hour/minute/millisecond, no matter how much data I have; what kind of guarantees can you provide me?”

  6. A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…” • Them: “Also, I need a decent answer within an hour/minute/millisecond, no matter how much data I have; what kind of guarantees can you provide me?” • Me: “Sorry, I can’t provide any such guarantees…”

  7. Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues

  8. Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist

  9. Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful

  10. Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful • isn’t Bayesian nonparametrics supposed to aim precisely at revealing new phenomena as data accrue? • why should it fail to help when too much data accrue?

  11. Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful • isn’t Bayesian nonparametrics supposed to aim precisely at revealing new phenomena as data accrue? • why should it fail to help when too much data accrue? • Why in general should statisticians not know what to do when they are given too much of their prized resource?

  12. What Do We Retreat From? • Probability models? • Hierarchical models? • Loss functions and point estimates? • Credible sets? • Full Bayesian posterior inference? • Asymptotic guarantees for inference algorithms? • Identifiability? • Coherence? • Real-world relevance?

  13. What Do We Retreat From? • Probability models? • Hierarchical models? • Loss functions and point estimates? • Credible sets? • Full Bayesian posterior inference? • Asymptotic guarantees for inference algorithms? • Identifiability? • Coherence? • Real-world relevance? • Any retreat should of course be a strategic retreat

  14. Part I: Computation/Statistics Tradeoffs via Convex Optimization with Venkat Chandrasekaran Caltech

  15. Whither the Theory of Inference? Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

  16. Computation/StatisticsTradeoffs • More data generally means more computation in our current state of understanding • but statistically more data generally means less risk (i.e., error) • and statistical inferences are often simplified as the amount of data grows • somehow these facts should have algorithmic consequences • I.e., somehow we should be able to get by with less computation as the amount of data grows • need a new notion of controlled algorithm weakening

  17. Time-Data Tradeoffs • Consider an inference problem with fixed risk • Inference procedures viewed as points in plot Runtime Number of samples n

  18. Time-Data Tradeoffs • Consider an inference problem with fixed risk • Vertical lines Classical estimation theory – well understood Runtime Number of samples n

  19. Time-Data Tradeoffs • Consider an inference problem with fixed risk • Horizontal lines Complexity theory lower bounds – poorly understood – depends on computational model Runtime Number of samples n

  20. Time-Data Tradeoffs • Consider an inference problem with fixed risk • Trade off upper bounds • More data means smaller runtime upper bound • Need “weaker” algorithms for larger datasets Runtime Number of samples n

  21. A Denoising Problem • Signal from known (bounded) set • Noise • Observation model • Observe ni.i.d. data points

  22. Convex Programming Estimator • An M-estimator • Convex relaxation • C is a convex set such that

  23. Hierarchy of Convex Relaxations • If “algebraic”, then one can obtain family of outer convex approximations • polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar) • Sets ordered by computational complexity • Central role played by lift-and-project

  24. Statistical Performance of Estimator • Consider cone of feasible directions into C

  25. Statistical Performance of Estimator • Theorem:The risk of the estimator is • Intuition: Only consider error in feasible cone • Can be refined for better bias-variance tradeoffs

  26. Hierarchy of Convex Relaxations • Corr: To obtain risk of at most 1, • Key point: If we have access to larger n, can use larger C

  27. Hierarchy of Convex Relaxations If we have access to larger n, can use larger C  Obtain “weaker” estimation algorithm

  28. Example 1 • consists of cut matrices • E.g., collaborative filtering, clustering

  29. Example 2 • Signal set consists of all perfect matchings in complete graph • E.g., network inference

  30. Example 3 • consists of all adjacency matrices of graphs with only a clique on square-root many nodes • E.g., sparse PCA, gene expression patterns • Kolar et al. (2010)

  31. Example 4 • Banding estimators for covariance matrices • Bickel-Levina (2007), many others • assume known variable ordering • Stylized problem: let M be known tridiagonal matrix • Signal set

  32. Part II: MAP-based Asymptotic Derivations from Bayes with Tamara Broderick and Brian Kulis

  33. Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance

  34. Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance • K-means can be orders of magnitude faster than Bayesian alternatives, even when doing hundreds of restarts • and speed matters

  35. Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance • K-means can be orders of magnitude faster than Bayesian alternatives, even when doing hundreds of restarts • and speed matters • K-means is an optimization procedure • optimization is a key tool for scalable frequentist inference • even if we can’t embrace it fully, let’s not avoid it

  36. MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over)

  37. MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over) • Inspiration: K-means can be derived as the limit of an EM algorithm for fitting a mixture model

  38. MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over) • Inspiration: K-means can be derived as the limit of an EM algorithm for fitting a mixture model • We do something similar in spirit, taking limits of various Bayesian nonparametric models: • Dirichlet process mixtures • hierarchical Dirichlet process mixtures • beta processes and hierarchical beta processes

  39. MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens

  40. MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens • who probably still won’t appreciate it

  41. MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens • who probably still won’t appreciate it • An alternative perspective: fullBayesian inference needs fast initialization

  42. K-means Clustering • Represent the data set in terms of K clusters, each of which is summarized by aprototype • Each data is assigned to one of K clusters • Represented by allocationssuch that for all data indices iwe have • Example: 4 data points and 3 clusters

  43. K-means Clustering • Cost function: the sum-of-squared distances from each data point to itsassigned prototype: • The K-means algorithm is coordinate descent on this cost function

  44. Coordinate Descent • Step 1:Fix values for and minimize w.r.t • assign each data point to thenearest prototype • Step 2:Fix values for and minimize w.r.t • this gives • Iterate these two steps • Convergence guaranteed since there are a finite number of possible settings for the allocations • It can only find local minima, so we should start the algorithm with many different initial settings

More Related