Bilinear Deep Learning for Image Classification

ACM Multimedia Dec. 1, 2011 Bilinear Deep Learning for Image Classification Shenghua ZHONG, Yan LIU, Yang LIU Department of Computing The Hong Kong Polytechnic University www.comp.polyu.edu.hk/~csshzhong

Outline • Problem • Image classification , a well-known challenge • Idea • Provide human-like judgment by referencing human visual system • Methodology • Deep models, consistent with human visual cortex • Proposed technique • Bilinear deep learning • Experiments • Caltech101, Urban and Natural Scene, CMU PIE • Conclusion • Explore deep techniques for more multimedia analysis tasks

Image Classification • Image classification is a classical problem • Aim to understand the semantic meaning of visual information • And determine the category of the images according to some predefined criteria • Image classification remains a well-known challenge • After more than fifteen years extensive research • Humans do not have difficulty with classifying images • Before the age of 25 months, children can recognize novel three-dimensional objects • Aim of this paper • Provide human-like judgment by referencing the architecture of the human visual system and the procedure of intelligent perception • Researchers from cognitive science and neuroscience have conducted pioneering work on modeling the human brain using computational architectures • Deep architecture is a representative paradigm that has achieved notable success in modeling the human visual system

Deep Learning • Physical structure • Human: dozens of cortical layers are involved in generating even the simplest vision • Deep: model the problem using multiple layers of parameterized nonlinear modules • Evolution of intelligence • Human: the multi-layers structure began to appear in the neocortex starting from old world monkeys about 40 million years ago • Deep: the development of intelligence follows with the multi-layer structure • Propagation of information • Human: several reasons for believing that our visual systems contain multi-layer generative models • Deep: layer-wise reconstruction to learn multiple levels of representation and abstraction that helps to make sense of data

Human Visual Cortex and Deep Architectures (a) Human visual cortex (b) An example of deep architectures

Bilinear Deep Learning • Deep architecture • Human: visual information through the optic tract to a nerve position is transmitted as the second-order data • Proposed technique: a novel Bilinear Deep Belief Network • Three-stage learning • Human: two peaks of activation with “initial guess” and the “post-recognition” in the visual cortex areas • Proposed techniques: a new Bilinear Discriminant Initialization • Semi-supervised framework • Human: more practical for long-term daily learning of visual world • Proposed techniques: a more flexible learning framework

Bilinear Deep Belief Network • Fully interconnected directed belief network are constructed by a set of second-order planes • The number of units in the input layer is equal to the resolution of the images • The number of units in the label layer is equal to the number of classes of images. • The number of units in each hidden layer is determined by bilinear discriminant initialization • The search of the mapping is transformed to the problem of finding the optimum parameter space for the deep architecture.

Three-stage Learning • Bilinear discriminant initialization • Human: early peak related to the activation of “initial guess” • Deep: determine the initial parameters and sizes of the upper layer • Greedy layer-wise reconstruction • Human: information propagation between adjacent layers • Deep: determine the parameters of each pair of layers • Global fine-tuning • Human: late peak related to the activation of “post-recognition” • Deep: refine the parameter space for better classification performance

Bilinear Discriminant Initialization • Latent representation with projection matrices U and V • Preserve discriminant information in the projected feature space by optimizing the objective function • Obtain the discriminant initial connections in layer pair and utilize the optimal dimension to define the structure of the next layer between-class weights within class weights

Greedy Layer-Wise Reconstruction • A joint configuration ( , ) of the input layer and the first hidden layer has energy • Utilize Contrastive Divergence algorithm to update the parameter space

Global Fine-Tuning • Backpropagation adjusts the entire deep network to find good local optimum parameters • Before backpropagation, a good region in the whole parameter space has been found • The convergence obtained from backpropagation learning is not slow • The result generally converge to a good local minimum on the error surface

Proposed Algorithm Bilinear Discriminant Initialization Greedy Layer-Wise Reconstruction Global Fine-Tuning

Experiment Setting • Database • Subset of Caltech101 • Standard dataset for image classification with images of 100 different objects • Frequently used subset including 2,935 images from the first 5 categories • Urban and Natural Scene • 2,688 natural color images with 8 categories • CMU PIE dataset • 11560 face images varying pose, illumination and expression of 68 subjects • Compared algorithms • K-nearest neighbor (KNN) • Support vector machines (SVM) • Transductive SVM (TSVM) [Collobert et al, JMLR, 2006] • Neural network (NN) • EmbedNN [Weston et al, ICML, 2008] • Semi-DBN [Bengio et al, NIPS, 2006] • DBN-rNCA [Salakhutdinov et al, AISTATS, 2007] • DDBN [Liu et al, PR, 2011] • DCNN [Jarrett et al, ICCV, 2009]

Experiments on Caltech101 • Sample images from datasets • Two experiments • Classification accuracy comparison • Converging time comparison • Experiment setting • 50 images for each category to form the test set • The rest to form the training set Faces_easy Faces Motorbikes Airplanes Back_google

Classification Accuracy Comparison on Caltech101 • Deep techniques achieve much better performance than shallow techniques • Bilinear deep belief networks is the best with different numbers of labeled data

Converging Time Comparison on Caltech 101 • Iterations infine-tuning stage • BDBN converges much more quickly because of better “initial guess”

Experiments on Urban and Natural Scene • Sample images from datasets • Two experiments • Classification accuracy and real running time comparison • Limitation discussion • Experiment setting • 50 images for each category to form the test set • The rest to form the training set Forest Street Mountain City Center Coast & beach Highway Open country Tall building

Performance Comparison on Urban and Natural Scene • Classical setting of neurons numbers in hidden layers • 500, 500, and 2000 • BDBN setting by bilinear discriminant initalization • 24*24, 21*21, 19*20 • For compared models • “_d” means the same size of BDBN setting • “_c” means the same size of classical setting

Limitation of Image Classification Only Based on Visual Similarity • Only calculating visual similarity is limited • Human can give the correct judgment by referencing the buildings and cars along the street • Contextual cueing • The knowledge about spatial invariants learned from past experiences • Street Highway Misclassified image Ground truth label: “Street” Misclassified label: “Highway”

Experiments on CMU PIE • Sample images from datasets • Two experiments • Classification accuracy comparison for noise data • Parameter space visualization • Experiment setting • 50 images for each category to form the test set • The rest to form the training set

Robustness to the Noise on CMU PIE The reconstruction of BDBN in every layer

Visualization of Parameter Space on CMU PIE Emphasize regions are identical to facial feature regions Facial feature points

Attractive Characters of Proposed BDBN • Novel deep architecture • Simulate the multi-layer physical structure of the visual cortex • Enable the preservation of the natural tensor structure of the input image in the information propagation • Three-stage learning • Realize the procedure of object recognition by human beings • Robust to noise • Bilinear discriminant initalization • Prevent information propagation falling into a bad local optimum • Provide a more meaningful setting for deep architecture • Much lower training time • Semi-supervised learning ability • Reduce the training requirements

Conclusion and Future Work • Reference the human visual system and perception procedure is promising • Deep learning is a good model for visual data analysis • Bilinear deep belief network shows impressive performance in image classification • Future work • Provide more semantic understanding of the images by integrating contextual cueing in deep modeling • Utilize deep learning for multimedia content analysis in a large scale dataset with noisy tags • More information • www.comp.polyu.edu.hk/~csshzhong • Paper, presentation slides, demo, source code, and datasets

Q & A Thanks! • Shenghua ZHONG, Yan LIU, Yang LIU • Department of Computing • The Hong Kong Polytechnic University • www.comp.polyu.edu.hk/~csshzhong

Bilinear Deep Learning for Image Classification