Using Fast Weights to Improve Persistent Contrastive Divergence

Using Fast Weights to ImprovePersistent Contrastive Divergence Tijmen TielemanGeoffrey Hinton Department of Computer Science, University of Toronto ICML 2009 presented byJorge Silva Department of Electrical and Computer Engineering, Duke University

Problems of interest:Density Estimation and Classification using RBMs • RBM = Restricted Boltzmann Machine: a stochastic version of a Hopfield network (i.e., recurrent neural network); often used as an associative memory • Can also be seen as a particular case of a Deep Belief Network (DBN) • Why “restricted”?Because we restrict connectivity: no intra-layer connections internal, or hidden representations hidden units data pattern (binary vector) visible units (Hinton, 2002; Smolensky 1986) adapted from www.iro.montreal.ca

Notation • Define the following energy function: • The joint probability P(v,h) and the marginal P(v) are state of the j-th hidden unit weight of the i-j connection visible state hidden state state of the i-th visible unit biases

Training with gradient descent • Training data likelihood (using just one datum for simplicity) • The positive gradient is easy: • But the negative gradient is intractable: • We can’t even sample from the model, so no MC approximation

Contrastive Divergence (CD) • However, we can approximately sample from the model. The existing Contrastive Divergence (CD) algorithm is one way to do it • CD gets the direction of the gradient approximately right, though not the magnitude • The rough idea behind CD is to: • start a Markov chain at one of the training points used to estimate • perform one Gibbs update, i.e., get • treat the configuration (h,v) as a sample from the model • What about “Persistent” CD? (Hinton, 2002)

Persistent Contrastive Divergence (PCD) • Use a persistent Markov chain that is not reinitialized at each time the parameters are changed • The learning rate should be small compared to the mixing rate of the Markov chain • Many persistent chains can be run in parallel; the corresponding (h,v) pairs are called “fantasy particles” • For a fixed amount of computation, RBMs can learn better models using PCD • Again, PCD is a previously existing algorithm (Neal, 1992; Tieleman, 2008)

Contributions and outline • Theoretical: show the interaction between the mixing rates and the weight updates in PCD • Practical: introduce fast weights, in addition to the regular weights. This improves the performance/speed tradeoff • Outline for the rest of the talk: • Mixing rates vs weight updates • Fast weights • PCD algorithm with fast weights (FPCD) • Experiments

Mixing rates vs weight updates • Consider M persistent chains • The states (v,h) of the chains define a distribution R consisting of M point masses • Assume M is large enough that we can ignore sampling noise • The weights are updated in the direction of the negative gradient of • P is the data distribution and is the intractable model distribution(being approximated by R) • is the vector of parameters (weights)

Mixing rates vs weight updates • Terms in the objective function: • The weight updates increase (which is bad), but • This is compensated by an increase in the mixing rates, makingdecrease rapidly (which is good) • Essentially, the fantasy particles quickly “rule out” large portions of the search space where Q is negligible this term is the neg. log-likelihood(minus the fixed entropy of P) this term is being maximizedw.r.t. \theta

Fast weights • In addition to the regular weights , the paper introduces fast weights • Fast weights are only used for fantasy particles; their learning rate is larger and their weight-decay is much stronger (weight-decay = ridge regression) • The role of the fast weights is to make the (combined) energy increase faster in the vicinity of the fantasy particles, making them mix faster • This way, the fantasy particles can escape low-energy local modes; this counteracts the progressive reduction in learning rates, which is otherwise desirable as learning progresses • The learning rate of the fast weights stays constant, but the weights themselves decay fast, so their effect is temporary (Bharath & Borkar, 1999)

PCD algorithm with fast weights (FPCD) weight decay

Experiments: MNIST dataset • Small-scale task: density estimation using an RBM with 25 hidden units • Larger task: classification using an RBM with 500 hidden units • In classification RBMs, there are two types of visible units: image units and label units. The RBM learns a joint density over both types. • In the plots, each point corresponds to 10 runs; in each run, the network was trained for a predetermined amount of time • Performance is measured on a held-out test set • The learning rate (for regular weights) decays linearly to zero over the computation time; for fast weights it is constant=1/e (Hinton et al., 2006; Larochelle & Bengio, 2008)

Experiments: MNIST dataset (fixed RBM size)

Experiments: MNIST dataset(optimized RBM size) • FPCD: 1200 hidden units • PCD: 700 hiden units

Experiments: Micro-NORB dataset • Classification task on 96x96 images, downsampled to 32x32 • MNORB dimensionality (before downsampling) is 18432, while MNIST is 784 • Learning rate decays as 1/t for regular weights (LeCun et al., 2004)

Experiments: Micro-NORB dataset non-monotonicityindicates overfittingproblems

Conclusion • FPCD outperforms PCD, especially when the number of weight updates is small • FPCD allows more flexible learning rate schedules than PCD • Results on the MNORB data also indicate outperformance in datasets where overfitting is a concern • Logistic regression on the full 18432-dimensional MNORB dataset had 23% misclassification; the RBM with FPCD achieved 26% on the reduced dataset • Future work: run FPCD for a longer time on an established dataset

Using Fast Weights to Improve Persistent Contrastive Divergence