Convolutional Restricted Boltzmann Machines for Feature Learning

Convolutional RestrictedBoltzmann Machines forFeature Learning Mohammad Norouzi Advisor: Dr. Greg Mori CS @ Simon Fraser University 27 Nov 2009

CRBMs forFeature Learning Mohammad Norouzi Advisor: Dr. Greg Mori CS @ Simon Fraser University 27 Nov 2009

Problems Human detection Handwritten digit classification

Sliding Window Approach

Sliding Window Approach (Cont’d) Decision Boundary [INRIA Person Dataset]

Success or Failure of an object recognition algorithm hinges on the features used Input Feature representation Label ? HumanBackground Classifier Our Focus 0 / 1 / 2 / 3 / … Learning

Local Feature Detector Hierarchies Larger More complicated Less frequent

Generative & Layerwise Learning ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GenerativeCRBM

Visual Features: Filtering Filter Response Filter Kernel (Feature)

Our approach to feature learningis generative ? ? ? Binary HiddenVariables (CRBM model)

Related Work

Related Work • Convolutional Neural Network (CNN) • Filtering layers are bundled with a classifier, and all the layers are learned together using error backpropagation. • Does not perform well on natural images • Biologically plausible models • Hand-crafted first layer vs. Randomly selected prototypes for second layer. Discriminative [Lecun et al. 98] [Ranzato et al. CVPR'07] No Learning [Serre et al., PAMI'07] [Mutch and Lowe, CVPR'06]

Related Work (cont’d) • Deep Belief Net • A two layer partially observed MRF, called RBM, is the building block • Learning is performed unsupervised and layer-by- layer from bottom layer upwards • Our contributions: We incorporate spatial locality into RBMs and adapt the learning algorithm accordingly • We add more complicated components such as pooling and sparsity into deep belief nets [Hinton et al., NC'2006] Generative & Unsupervised

Why Generative &Unsupervised • Discriminative learning of deep and large neural networks has not been successful • Requires large training sets • Easily gets over-fitted for large models • First layer gradients are relatively small • Alternative hybrid approach • Learn a large set of first layer features generatively • Switch to a discriminative model to select the discriminative features from those that are learned • Discriminative fine-tuning is helpful

Details

CRBM • Image is the visible layer and hidden layer is related to filter responses • An energy based probabilistic model Dot product of vectorized matrices

Training CRBMs • Maximum likelihood learning of CRBMs is difficult • Contrastive Divergence (CD) learning is applicable • For CD learning we need to compute the conditionals and . data sample

CRBM (Backward) • Nearby hidden variablescooperate in reconstruction • Conditional Probabilities take the form

Learning the Hierarchy • The structure is trained bottom up and layerwise • The CRBM model for training filtering layers • Filtering layers are followed by down-sampling layers CRBM CRBM Classifier Pooling Pooling FilteringNon-linearity Reduce thedimensionality

Responses Responses 1st Filters 2nd Filters Input 1 2 3 4

Experiments

Evaluation MNIST digit dataset INRIA person dataset Training set: 2416 person windows of size 128 x 64 pixels and 4.5x106 negative windows Test set: 1132 positive and 2x106 negative windows • Training set: 60,000 image of digits of size 28x28 • Test set: 10,000 images

First layer filters • Gray-scale images of INRIA positive set • 15 filters of 7x7 • MNIST unlabeled digits • 15 filters of 5x5

Second Layer Features (MNIST) • Hard to visualize the filters • We show patches highly responded to filters: 24

Second Layer Features (INRIA)

MNIST Results • MNIST error rate when model is trained on the full training set

Results False Positive

1st

2nd

3rd

4th

5th

INRIA Results • Adding our large-scale features significantly improves performance of the baseline (HOG)

Conclusion • We extended the RBM model to Convolutional RBM, useful for domains with spatial locality • We exploited CRBMs to train local hierarchical feature detectors one layer at a time and generatively • This method obtained results comparable to state-of-the-art in digit classification and human detection

Thank You 

Hierarchical Feature Detector

Contrastive Divergence Learning

Training CRBMs (Cont'd) • The problem of reconstructing border region becomes severe when number of Gibbs sampling steps > 1. • Partition visible units into middle and border regions • Instead of maximizing thelikelihood, we (approximately)maximize

Enforcing Feature Sparsity • The CRBM's representation is K (number of filters) times overcomplete • After a few CD learning iterations, V is perfectly reconstructed • Enforce sparsity to tackle this problem • Hidden bias terms were frozen at large negative values • Having a single non-sparse hidden unit improves the learned features • Might be related to the ergodicity condition

Probabilistic Meaning of Max Max 1 1 2 3 4 1 2 2 1 2 1 3 4 5 6 3 4 5 6 2

The Classifier Layer • We used SVM as our final classifier • RBF kernel for MNIST • Linear kernel for INRIA • For INRIA we combined our 4th layer outputs and HOG features • We experimentally observed that relaxing the sparsity of CRBM's hidden units yields better results • This lets the discriminative model to set the thresholds itself

Why HOG features are added? • Because part-like features are very sparse • Having a template of the human figure helps a lot f

RBM • Two layer pairwise MRF with a full setof hidden-visible connections • RBM Is an energy based model • Hidden random variables are binary, Visible variables can be binary or continuous • Inference is straightforward: and • Contrastive Divergence learning for training h w v

Why Unsupervised Bottom-Up • Discriminative learning of deep structure has not been successful • Requires large training sets • Easily is over-fitted for large models • First layer gradients are relatively small • Alternative hybrid approach • Learn a large set of first layer features generatively • Later, switch to a discriminative model to select the discriminative features from those learned • Fine-tune the features using

INRIA Results (Cont'd) • Missrate at different FPPW rates • FPPI is a better indicator of performance • More experiments on size of features and number of layers are desired

Convolutional Restricted Boltzmann Machines for Feature Learning