MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Michael I. Jordan INRIA University of California, Berkeley May 11, 2013 Acknowledgments: Brian Kulis, Tamara Broderick

Statistical Inference and Big Data • Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data

Statistical Inference and Big Data • Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data • In Bayesian inference the focus is on the models • burgeoning literature on Bayesian nonparametricsprovides stochastic processes for representing flexible data structures • but the algorithmic choices are limited

Statistical Inference and Big Data • Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data • In Bayesian inference the focus is on the models • burgeoning literature on Bayesian nonparametricsprovides stochastic processes for representing flexible data structures • but the algorithmic choices are limited • So Big Data research hasn’t made much use of Bayes, and is instead optimization-based • but the model choices tend to be limited

Bayesian Nonparametric Modeling • Examples of stochastic processes used in Bayesian nonparametricsinclude distributions on: • directed trees of unbounded depth and unbounded fan-out • partitions • grammars • Markov processes with unbounded state spaces • infinite-dimensional matrices • functions (smooth and non-smooth) • copulae • distributions • Power laws arise naturally in these distributions • Hierarchical modeling uses these stochastic processes as building blocks

The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms

The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design • e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer

The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design • e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer • Where does the loss function come from?

The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design • e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer • Where does the loss function come from? • Gauss, Huber, Fisher, …

The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design • e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer • Where does the loss function come from? • Gauss, Huber, Fisher, … • It’s all very parametric, and the transition to nonparametrics is a separate step

This Talk • Bayesian nonparametrics meets optimization • flexible, scalable modeling framework • gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc

This Talk • Bayesian nonparametrics meets optimization • flexible, scalable modeling framework • gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc • Inspiration: the venerable, scalable K-means algorithm can be derived as the limit of an EM algorithm for fitting a mixture model

This Talk • Bayesian nonparametrics meets optimization • flexible, scalable modeling framework • gives rise to new loss functions and regularizers that are naturally nonparametric • no recourse to MCMC, SMC, etc • Inspiration: the venerable, scalable K-means algorithm can be derived as the limit of an EM algorithm for fitting a mixture model • We do something similar in spirit, taking limits of various Bayesian nonparametric models: • Dirichlet process mixtures • hierarchical Dirichlet process mixtures • beta processes and hierarchical beta processes

K-means Clustering • Represent the data set in terms of K clusters, each of which is summarized by aprototype • Each data is assigned to one of K clusters • Represented by allocationssuch that for all data indices iwe have • Example: 4 data points and 3 clusters

K-means Clustering • Cost function: the sum-of-squared distances from each data point to itsassigned prototype: • The K-means algorithm is coordinate descent on this cost function

Coordinate Descent • Step 1:Fix values for and minimize w.r.t • assign each data point to thenearest prototype • Step 2:Fix values for and minimize w.r.t • this gives • Iterate these two steps • Convergence guaranteed since there are a finite number of possible settings for the allocations • It can only find local minima, so we should start the algorithm with many different initial settings

From Gaussian Mixtures to K-means • A Gaussian mixture model: • Set the mixing proportions to • Write down the EM algorithm for fitting this model • Take to zero and recover the K-means algorithm • the E step of EM is Step 1 of K-means • the M step of EM is Step 2 of K-means

The K in K-means • What if K is not known? • a challenging model selection problem • the algorithm itself is silent on the problem • The Gaussian mixture model perspective brings the tools of Bayesian model selection to bear in principle, but not in the limit • How about starting with Dirichlet process mixtures and Chinese restaurant processes instead of finite mixture models?

Chinese Restaurant Process (CRP) • A random process in which customers sit down in a Chinese restaurant with an infinite number of tables • first customer sits at the first table • th subsequent customer sits at a table drawn from the following distribution: • where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated

The CRP and Clustering • Data points are customers; tables are mixture components • the CRP defines a prior distribution on the partitioning of the data and on the number of tables • This prior can be completed with: • a likelihood---e.g., associate a parameterized probability distribution with each table • a prior for the parameters---the first customer to sit at table chooses the parameter vector, , for that table from a prior • We want to write out all of these probabilities and then take a scale parameter to zero

CRP Prior, Gaussian Likelihood, Conjugate Prior

The CRP Prior • Let denote a partition of the integers 1 through N, and let • Then, under the CRP, we have: • This function (the EPPF) is a function only of the cardinalities of the partition; this implies exchangeability

The Joint Probability • Encode the partition with allocation variables • The joint probability of the allocations and the data is the product of the EPPF and the usual mixture model likelihood, where is now random • I.e., we obtain a joint probability • And we can find the MAP estimate of a clustering via:

Small Variance Asymptotics • Now let the likelihood be Gaussian and take the variance to zero • We do this analytically by picking a rate constant and reparameterizing: • And letting go to zero we get: • I.e., a penalized form of the K-means objective

Coordinate Descent: DP Means • Reassign a point to the cluster corresponding to the closest mean, unless the closest cluster has squared Euclidean distance greater than . In this case, start a new cluster. • Given the cluster assignments, perform Gibbs moves on all the means, which amounts to sampling from the posterior based on and all observations in a cluster.

The CRP and Exchangeability • The CRP is a distribution on partitions; it is an exchangeable distribution on partitions • By De Finetti’s theorem, there must exist an underlying random measure such that the CRP is obtained by integrating out that random measure • That random measure turns out to be the Dirichlet process (e.g., Blackwell & MacQueen, 1972)

The De Finetti Theorem • An infinite sequence of random variables is called infinitely exchangeable if the distribution of any finite subsequence is invariant to permutation • Theorem: infinite exchangeability if and only if for some random measure

Random Measures and Their Marginals • The De Finetti random measure is known for a number of interesting combinatorial stochastic processes: • Dirichlet process => Chinese restaurant process (Polya urn) • Beta process => Indian buffet process • Hierarchical Dirichlet process => Chinese restaurant franchise • HDP-HMM => infinite HMM • Nested Dirichlet process => nested Chinese restaurant process

Completely Random Measures (Kingman, 1967) • Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of • e.g., Poisson processes, gamma processes, beta processes, compound Poisson processes and limits thereof • The Dirichlet process is not a completely random measure • but it's a normalized gamma process • Completely random processes are discrete wp1 (up to a possible deterministic continuous component) • Completely random measures are random measures, not necessarily random probability measures

Completely Random Measures (Kingman, 1967) • Assigns independent mass to nonintersecting subsets of x x x x x x x x x x x x x x x

Completely Random Measures (Kingman, 1967) • Consider a Poisson random measure on with rate function specified as a product measure • Sample from this Poisson process and connect the samples vertically to their coordinates in x x x x x x x x x

Gamma Process • The gamma process is a CRM for which the rate function is given as follows (on ): • Draw a sample from a Poisson random measure with this rate measure • And the resulting random measure can be written simply as:

Dirichlet Process • The Dirichlet process is a normalized gamma process x x x x x x x x x x x x x x x

Dirichlet Process Marginals • Consider the following hierarchy: • The variables are clearly exchangeable • The partition structure that they induce is exactly the Chinese restaurant process

Dirichlet Process Mixture Models

Multiple Estimation Problems • We often face multiple, related estimation problems • E.g., multiple Gaussian means: • Maximum likelihood: • Maximum likelihood often doesn't work very well • want to “share statistical strength”

Hierarchical Bayesian Approach • The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable • Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups

Hierarchical Modeling • The plate notation: • Equivalent to:

Hierarchical Dirichlet Process Mixtures

Application: Protein Modeling • A protein is a folded chain of amino acids • The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) • Empirical plots of phi and psi angles are called Ramachandran diagrams

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes