MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Michael I. Jordan University of California, Berkeley INRIA June 12, 2013 Co-authors: Tamara Broderick, Brian Kulis

A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in”

A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?”

A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…”

A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…” • Them: “Also, I need a decent answer within an hour/minute/millisecond, no matter how much data I have; what kind of guarantees can you provide me?”

A Typical Conversation in Silicon Valley • Me: “I have an exciting new statistical methodology that provides a new capability you might be interested in” • Them: “I have a petabyte of data. Will your exciting new statistical methodology run on my data?” • Me: “Well…” • Them: “Also, I need a decent answer within an hour/minute/millisecond, no matter how much data I have; what kind of guarantees can you provide me?” • Me: “Sorry, I can’t provide any such guarantees…”

Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues

Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist

Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful

Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful • isn’t Bayesian nonparametrics supposed to aim precisely at revealing new phenomena as data accrue? • why should it fail to help when too much data accrue?

Statistical Inference and Big Data • So I’ve been working on computation-meets-inference issues • Sad to say, I haven’t found Bayesian perspectives very helpful; all of my progress to date on the topic has been frequentist • Even sadder to say, Bayesian nonparametrics hasn’t been helpful • isn’t Bayesian nonparametrics supposed to aim precisely at revealing new phenomena as data accrue? • why should it fail to help when too much data accrue? • Why in general should statisticians not know what to do when they are given too much of their prized resource?

What Do We Retreat From? • Probability models? • Hierarchical models? • Loss functions and point estimates? • Credible sets? • Full Bayesian posterior inference? • Asymptotic guarantees for inference algorithms? • Identifiability? • Coherence? • Real-world relevance?

What Do We Retreat From? • Probability models? • Hierarchical models? • Loss functions and point estimates? • Credible sets? • Full Bayesian posterior inference? • Asymptotic guarantees for inference algorithms? • Identifiability? • Coherence? • Real-world relevance? • Any retreat should of course be a strategic retreat

Part I: Computation/Statistics Tradeoffs via Convex Optimization with Venkat Chandrasekaran Caltech

Whither the Theory of Inference? Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

Computation/StatisticsTradeoffs • More data generally means more computation in our current state of understanding • but statistically more data generally means less risk (i.e., error) • and statistical inferences are often simplified as the amount of data grows • somehow these facts should have algorithmic consequences • I.e., somehow we should be able to get by with less computation as the amount of data grows • need a new notion of controlled algorithm weakening

Time-Data Tradeoffs • Consider an inference problem with fixed risk • Inference procedures viewed as points in plot Runtime Number of samples n

Time-Data Tradeoffs • Consider an inference problem with fixed risk • Vertical lines Classical estimation theory – well understood Runtime Number of samples n

Time-Data Tradeoffs • Consider an inference problem with fixed risk • Horizontal lines Complexity theory lower bounds – poorly understood – depends on computational model Runtime Number of samples n

Time-Data Tradeoffs • Consider an inference problem with fixed risk • Trade off upper bounds • More data means smaller runtime upper bound • Need “weaker” algorithms for larger datasets Runtime Number of samples n

A Denoising Problem • Signal from known (bounded) set • Noise • Observation model • Observe ni.i.d. data points

Convex Programming Estimator • An M-estimator • Convex relaxation • C is a convex set such that

Hierarchy of Convex Relaxations • If “algebraic”, then one can obtain family of outer convex approximations • polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar) • Sets ordered by computational complexity • Central role played by lift-and-project

Statistical Performance of Estimator • Consider cone of feasible directions into C

Statistical Performance of Estimator • Theorem:The risk of the estimator is • Intuition: Only consider error in feasible cone • Can be refined for better bias-variance tradeoffs

Hierarchy of Convex Relaxations • Corr: To obtain risk of at most 1, • Key point: If we have access to larger n, can use larger C

Hierarchy of Convex Relaxations If we have access to larger n, can use larger C  Obtain “weaker” estimation algorithm

Example 1 • consists of cut matrices • E.g., collaborative filtering, clustering

Example 2 • Signal set consists of all perfect matchings in complete graph • E.g., network inference

Example 3 • consists of all adjacency matrices of graphs with only a clique on square-root many nodes • E.g., sparse PCA, gene expression patterns • Kolar et al. (2010)

Example 4 • Banding estimators for covariance matrices • Bickel-Levina (2007), many others • assume known variable ordering • Stylized problem: let M be known tridiagonal matrix • Signal set

Part II: MAP-based Asymptotic Derivations from Bayes with Tamara Broderick and Brian Kulis

Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance

Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance • K-means can be orders of magnitude faster than Bayesian alternatives, even when doing hundreds of restarts • and speed matters

Clustering, K-means and Optimization • After 40 years, K-means is still the method of choice for solving clustering problems in many applied domains • particularly when data sets are large • this isn’t just due to ignorance • K-means can be orders of magnitude faster than Bayesian alternatives, even when doing hundreds of restarts • and speed matters • K-means is an optimization procedure • optimization is a key tool for scalable frequentist inference • even if we can’t embrace it fully, let’s not avoid it

MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over)

MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over) • Inspiration: K-means can be derived as the limit of an EM algorithm for fitting a mixture model

MAD-Bayes • Bayesian nonparametrics assists the optimization-based inference community • Bayes delivers the flexible, modular modeling framework that the optimization-based community has been lacking • Bayesian nonparametrics gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc (which make eyes glaze over) • Inspiration: K-means can be derived as the limit of an EM algorithm for fitting a mixture model • We do something similar in spirit, taking limits of various Bayesian nonparametric models: • Dirichlet process mixtures • hierarchical Dirichlet process mixtures • beta processes and hierarchical beta processes

MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens

MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens • who probably still won’t appreciate it

MAD-Bayes • In short, we apply an inverse beautification operator to our lovely Bayesian nonparametrics framework to make it appeal to a bunch of heathens • who probably still won’t appreciate it • An alternative perspective: fullBayesian inference needs fast initialization

K-means Clustering • Represent the data set in terms of K clusters, each of which is summarized by aprototype • Each data is assigned to one of K clusters • Represented by allocationssuch that for all data indices iwe have • Example: 4 data points and 3 clusters

K-means Clustering • Cost function: the sum-of-squared distances from each data point to itsassigned prototype: • The K-means algorithm is coordinate descent on this cost function

Coordinate Descent • Step 1:Fix values for and minimize w.r.t • assign each data point to thenearest prototype • Step 2:Fix values for and minimize w.r.t • this gives • Iterate these two steps • Convergence guaranteed since there are a finite number of possible settings for the allocations • It can only find local minima, so we should start the algorithm with many different initial settings

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Presentation Transcript

From Evidence-based Practice to Practice-based Evidence: Behavior Analysis in Special Education

An introduction to Bayesian Networks and the Bayes Net Toolbox for Matlab

The Bayes Net Toolbox for Matlab and applications to computer vision

Network Payload-based Anomaly Detection and Content-based Alert Correlation

Model-Based Testing and Test-Based Modelling

Evidence-Based Practice

Maximum Entropy

Chapter 3: Supervised Learning

Tree-based and Forest-based Translation

Metallicity Dependence of Winds from Red SuperGiants and Asymptotic Giant Branch Stars

Besov Bayes Chomsky Plato

HPCI Centre Presentation

CS 61b: Final Review

Recursive Bayes Filtering Advanced AI

人工智能 Artificial Intelligence

CS221: Algorithms and Data Structures Lecture #1 Complexity Theory and Asymptotic Analysis

Role Based Firewall