790 likes | 958 Views
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes. Michael I. Jordan University of California, Berkeley INRIA. June 12, 2013. Co-authors : Tamara Broderick, Brian Kulis. A Typical Conversation in Silicon Valley.
E N D
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Michael I. Jordan University of California, Berkeley INRIA June 12, 2013 Co-authors: Tamara Broderick, Brian Kulis
A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in”
A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?”
A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…”
A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…” • Them: “Also, I need a decent answer within an hour/minute/millisecond, no matter how much data I have; what kind of guarantees can you provide me?”
A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…” • Them: “Also, I need a decent answer within an hour/minute/millisecond, no matter how much data I have; what kind of guarantees can you provide me?” • Me: “Sorry, I can’t provide any such guarantees…”
Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues
Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist
Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful
Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful • isn’t Bayesian nonparametrics supposed to aim precisely at revealing new phenomena as data accrue? • why should it fail to help when too much data accrue?
Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful • isn’t Bayesian nonparametrics supposed to aim precisely at revealing new phenomena as data accrue? • why should it fail to help when too much data accrue? • Why in general should statisticians not know what to do when they are given too much of their prized resource?
What Do We Retreat From? • Probability models? • Hierarchical models? • Loss functions and point estimates? • Credible sets? • Full Bayesian posterior inference? • Asymptotic guarantees for inference algorithms? • Identifiability? • Coherence? • Real-world relevance?
What Do We Retreat From? • Probability models? • Hierarchical models? • Loss functions and point estimates? • Credible sets? • Full Bayesian posterior inference? • Asymptotic guarantees for inference algorithms? • Identifiability? • Coherence? • Real-world relevance? • Any retreat should of course be a strategic retreat
Part I: Computation/Statistics Tradeoffs via Convex Optimization with Venkat Chandrasekaran Caltech
Whither the Theory of Inference? Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)
Computation/StatisticsTradeoffs • More data generally means more computation in our current state of understanding • but statistically more data generally means less risk (i.e., error) • and statistical inferences are often simplified as the amount of data grows • somehow these facts should have algorithmic consequences • I.e., somehow we should be able to get by with less computation as the amount of data grows • need a new notion of controlled algorithm weakening
Time-Data Tradeoffs • Consider an inference problem with fixed risk • Inference procedures viewed as points in plot Runtime Number of samples n
Time-Data Tradeoffs • Consider an inference problem with fixed risk • Vertical lines Classical estimation theory – well understood Runtime Number of samples n
Time-Data Tradeoffs • Consider an inference problem with fixed risk • Horizontal lines Complexity theory lower bounds – poorly understood – depends on computational model Runtime Number of samples n
Time-Data Tradeoffs • Consider an inference problem with fixed risk • Trade off upper bounds • More data means smaller runtime upper bound • Need “weaker” algorithms for larger datasets Runtime Number of samples n
A Denoising Problem • Signal from known (bounded) set • Noise • Observation model • Observe ni.i.d. data points
Convex Programming Estimator • An M-estimator • Convex relaxation • C is a convex set such that
Hierarchy of Convex Relaxations • If “algebraic”, then one can obtain family of outer convex approximations • polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar) • Sets ordered by computational complexity • Central role played by lift-and-project
Statistical Performance of Estimator • Consider cone of feasible directions into C
Statistical Performance of Estimator • Theorem:The risk of the estimator is • Intuition: Only consider error in feasible cone • Can be refined for better bias-variance tradeoffs
Hierarchy of Convex Relaxations • Corr: To obtain risk of at most 1, • Key point: If we have access to larger n, can use larger C
Hierarchy of Convex Relaxations If we have access to larger n, can use larger C Obtain “weaker” estimation algorithm
Example 1 • consists of cut matrices • E.g., collaborative filtering, clustering
Example 2 • Signal set consists of all perfect matchings in complete graph • E.g., network inference
Example 3 • consists of all adjacency matrices of graphs with only a clique on square-root many nodes • E.g., sparse PCA, gene expression patterns • Kolar et al. (2010)
Example 4 • Banding estimators for covariance matrices • Bickel-Levina (2007), many others • assume known variable ordering • Stylized problem: let M be known tridiagonal matrix • Signal set
Part II: MAP-based Asymptotic Derivations from Bayes with Tamara Broderick and Brian Kulis
Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance
Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance • K-means can be orders of magnitude faster than Bayesian alternatives, even when doing hundreds of restarts • and speed matters
Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance • K-means can be orders of magnitude faster than Bayesian alternatives, even when doing hundreds of restarts • and speed matters • K-means is an optimization procedure • optimization is a key tool for scalable frequentist inference • even if we can’t embrace it fully, let’s not avoid it
MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over)
MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over) • Inspiration: K-means can be derived as the limit of an EM algorithm for fitting a mixture model
MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over) • Inspiration: K-means can be derived as the limit of an EM algorithm for fitting a mixture model • We do something similar in spirit, taking limits of various Bayesian nonparametric models: • Dirichlet process mixtures • hierarchical Dirichlet process mixtures • beta processes and hierarchical beta processes
MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens
MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens • who probably still won’t appreciate it
MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens • who probably still won’t appreciate it • An alternative perspective: fullBayesian inference needs fast initialization
K-means Clustering • Represent the data set in terms of K clusters, each of which is summarized by aprototype • Each data is assigned to one of K clusters • Represented by allocationssuch that for all data indices iwe have • Example: 4 data points and 3 clusters
K-means Clustering • Cost function: the sum-of-squared distances from each data point to itsassigned prototype: • The K-means algorithm is coordinate descent on this cost function
Coordinate Descent • Step 1:Fix values for and minimize w.r.t • assign each data point to thenearest prototype • Step 2:Fix values for and minimize w.r.t • this gives • Iterate these two steps • Convergence guaranteed since there are a finite number of possible settings for the allocations • It can only find local minima, so we should start the algorithm with many different initial settings