10 likes | 121 Views
Particle Filtered MCMC-MLE with Connections to Contrastive Divergence. Arthur Asuncion, Qiang Liu, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine. 1. Motivation
E N D
Particle Filtered MCMC-MLE with Connections to Contrastive Divergence Arthur Asuncion, Qiang Liu, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine • 1. Motivation • Undirected models are useful in many settings. Consider models in exponential family form: • Task: Given i.i.d. data, • estimate parameters accurately and quickly • Maximum likelihood estimation (MLE): • Need to resort to approximate techniques: • Pseudolikelihood / composite likelihoods • Sampling-based techniques (e.g. MCMC-MLE) • Contrastive divergence (CD) learning • We propose particle filtered MCMC-MLE • 3. Contrastive Divergence (CD) • Widely-used machine learning algorithm for learning undirected models [Hinton, 2002] • CD can be motivated by taking gradient of log-likelihood directly: • CD-n samples from current model (approx.): • Initialize chains at empirical data distribution • Only run n MCMC steps • Persistent CD: initialize chains at samples at previous iteration [Tieleman, 2008] • 5. Experimental Analysis • Visible Boltzmann machines: • Exponential random graph models (ERGMs): • Conditional random fields (CRFs): • Restricted Boltzmann machines (RBMs): Partition function usually intractable Network statistics: # edges # 2-stars # triangles Run MCMC under θ for n steps … • 2. MCMC-MLE • Widely used in statistics [Geyer, 1991] • Idea: draw samples from alternate distribution p(x|θ0) using MCMC, to approximate the likelihood: • To optimize approximate likelihood, use gradient: • Degeneracy problems if θ moves far from initial θ0 • 4. Particle Filtered MCMC-MLE (PF) • Use sampling-importance-resampling (SIR) with MCMC rejuvenation to estimate gradient • Monitor effective sample size (ESS): • If ESS (“health” of particles) is low: • Resample particles in proportion to w • Rejuvenate with n MCMC steps based on θ • PF can avoid MCMC-MLE’s degeneracy issues • PF can be potentially faster than CD since it only “rejuvenates” when ESS is low • As the number of particles approaches infinity, PF recovers the exact log-likelihood gradient Experiments on MNIST data. 500 hidden units used. Monte Carlo approximation • 6. Conclusions • Particle filtered MCMC-MLE can avoid the degeneracy issues of MCMC-MLE by performing resampling and rejuvenation • Particle filtered MCMC-MLE is sometimes faster than CD since it only rejuvenates when needed • There is a unified view of all these algorithms MCMC-MLE uses importance sampling to estimate gradient Run MCMC under θ for n steps Update θ using approximate gradient Run MCMC under p(x|θ0) until equilibrium If ESS is low, resample and rejuvenate Calculate weight and check ESS … PF can be viewed as a “hybrid” between MCMC-MLE and CD Calculate new weight