Visualizing Convolutional Networks in ECCV 2014 Article Analysis

*A presentation for this article: ‘M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014.’ ‘Visualizing and Understanding Convolutional Networks’ Reporter: Xi Mo Ph.D. Candidate of ITTC, University of Kansas Date: Feb 13rd, 2017

What will be expected in this article? 1 1.1 ImageNet • About the ImageNet (http://www.image-net.org/) - It’s the largest image database in the world established by computer scientists of University of Stanford, as a trial for the simulation of human’s visual perception and recognition system. Worldwide scholars upload their classification and recognition algorithms to yield the best results. ‘Please click on lizards in the pictures’ ‘Please click on radios in the pictures’ Three examples of login verification code on http://www.12306.com, the only official website of Chinese Ministry of Railways sells train tickets, which gets poor rating and complaints all the time. ‘Please click on cotton swabs and chalks in the pictures’ ImageNET 2015 Ranking

What will be expected in this article? 1 1.2 Recognition using Artificial Neural Networks • There is a video of how a simple 4-layered Neural Networks on the website https://www.coursera.org/course/neuralnets given by Geoffrey Hinton, the professor of the University of Toronto. And a structure of Convolutional Neural Network (CNN) is shown too. Conv Layer Pooling Layer Pooling Layer Input Layer Output Layer Conv Layer 12 maps 12 maps fully conneted fully conneted 6 maps 1 map The CNN of DeepLearn Toolbox in MatLab with 5x5 convolution kernel

What will be expected in this article? 1 1.3 Main Question and Structure • The article mainly focuses on the solution to the problems- ‘there is no clear understanding of why they (Large Convolutional Network models) perform so well, or how they might be improved’. • The main idea of the article is shown below. The Authors’ Main Contribution-A Tool • A novel way to visualize the activity within the model, reveals the features to be far from random, uninterpretable patterns. No.1 Theory Verification • Start with the architecture of [1] and explore different architectures, discovering ones that outperform results of [1] on ImageNet. What Else May the Theory Perform • Debug problems with the model to obtain better results. No.2 Theory Verification • Improving on [1] gains impressive ImageNet 2012 result, which shows how the trained model can generalize well to other datasets.

The Authors’ Main Contribution-A Tool 2 2.1 Deconvolutional Network (Deconvnet) [2] • The novel way to understand the operation of a convolutional network (convnet) by a deconvnet model maps feature activity in intermediate layers that needs to be interpreted for understanding the operation of a convnet back to the input pixel space (input image), which shows what input pattern originally caused a given activation in the feature maps. The deconvnet isn’t used for unsupervised learning, just as a probe of a already trained convnet. • To examine an already trained convnet, a deconvnet is attached to each of its layers, as illustrated below, providing a continuous path back to image pixels. To examine a given convnet activation, set all other activations in the convnet layer to zero and pass the feature maps as input to the attached deconvnet layer. To start, an input image is presented to the convnet and features computed throughout the layers. Unpooling: Despite the non-invertibility of max pooling operation, an approximate inverse could be obtained by recording the locations of the maxima within each pooling region in a set of switch variables. convnet layer deconvnet layer Rectification: The convnet uses relu non-linearities(relu(x) = max(x, 0)), which rectify the feature maps thus ensuring the feature maps are always positive. Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. • Procedure of using the deconvnet to examine a convnet

The Authors’ Main Contribution-A Tool 2 2.2 8 Layer Convolutional Neural Network Model [1][6][7] • The standard fully supervised convnet models proposed by the authors are applied in the article, and in many experiments. • A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: • a)Passed through a rectified linear function (not shown) • b) Pooled (max within 3x3 regions, using stride 2) and • c) Contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. • d) The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). • e) The final layer is a C-way softmax function, C being the number of classes.All filters and feature maps are square in shape. • Each layer consists of • a) Convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; • b) Passing the responses through a rectified linear function; • c) [optionally] Max pooling over local neighborhoods and • d) [optionally] A local contrast operation that normalizes the responses across feature maps.

No.1 Theory Verification 3 3.1 Training Details • Here are some (not all) selected training details of the convnet model that is necessary for illustrating the performance: a) Due to the model being split across 2 GPUs, it must-have to modify the structure of the network. b) The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). c) Each RGB image was preprocessed (resizing the smallest dimension, cropping the center region, subtracting the per-pixel mean, etc.) d) To adjust the weights of the network during the learning process, a Simulation Annealing method is employed with all weights initialized to 0.01 and biases nodes are set to 0. e) The training stopped after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on [1]. http://image-net.org/challenges/LSVRC/2012/analysis/

No.1 Theory Verification 3 3.2 Feature Visualization • Instead of showing the single strongest activation for a given feature map, the article shows the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. • Alongside these visualizations are the corresponding image patches. • It can be seen form the results as a) to e), for further visual feature analysis as feature evolution during Training and feature invariance are not listed. a) The projections from each layer show the hierarchical nature of the features in the network. b) Layer 2 responds to corners and other edge/color conjunctions. c) Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns). d) Layer 4 shows significant variation, but is more class-specific. e) Layer 5 shows entire objects with significant pose variation.

What Else May the Theory Perform? 4 Debugging • By visualizing the first and second layers architecture of [1] various problems are apparent. • a) The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. • b) The 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. • To remedy a) and b) above, • For a), reduced the 1st layer filter size from 11x11 to 7x7. • For b), made the stride of the convolution 2, rather than 4. Improvement • Other two key points the article discussed are: • a) Occlusion Sensitivity: If the model is truly identifying the location of the object in the image, or just using the surrounding context? • b) Correspondence Analysis: Deep models differ from many existing recognition approaches in that there is no explicit mechanism for establishing correspondence between specific object parts in different images, however, an intriguing possibility is that deep models might be implicitly computing them. 1st layer features from [1] Rectified features 1st layer features without feature scale clipping 2nd layer features from [1] Rectified features

No.2 Theory Verification 5 ‘Next we analyze the performance of our model with the architectural changes’ significantly outperforms the architecture of [1], beating their single model result by 1:7% (test top-5). When we combine multiple models, we obtain a test error of 14:8%, the best published performance on this dataset.’ ‘Using the exact architecture specified in [1], …, we achieve an error rate within 0:1% of their reported value on the ImageNet 2012 validation set.’ Compare with various model architectures of [1] performed on the ImageNet dataset For Caltech-256, ‘Our ImageNet-pretrained model beats the current state-of-the-art results obtained by [4][5] by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. … .With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many images. This shows the power of the ImageNet feature extractor.’ ‘For Caltech-101, We follow the procedure of [3] and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies on below, using 5 train/test folds. Training took 17 minutes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from [4] by 2.2%. The convnet model trained from scratch however does terribly, only achieving 46.5%.’ Compare with other methods using Caltech-101 and Caltech-256 datasets

Conclusion 6 • In This article, a dedicatedly designed Convolutional Neural Network along with its intermediate feature inspection tool - a Deconvolutional Neural Network is introduced. With both of which, this presentation list partially their performance in experiment of the famous ImageNet dataset and some other datasets as Caltech-101 and Caltech-256 except the PASCAL 2012 dataset mentioned. • Though this work form ECCV 2014 gains great success in the understanding of inner features of a sophisticated Neural Network, there are some personal opinions towards the proposed structure. For instance, the proposed model belongs to supervised learning ones, which in my thoughts the unsupervised ones are still remain to be solved; as a tool to discover the features of hidden layers, there seems to be little mathematical representations of the observed features in the context. • It’s a pity not seeing the ‘Future works’ section in this article. However from the part. 4 of this presentation discussing ‘Improvement’, I strongly believed that to inspect a deep-learning model will be a prosperous direction in the field of machine learning.

THANKS Thank you for your attention!

Reference For The Presentation [1] Krizhevsky, A., Sutskever, I., and Hinton, G.E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [2] Zeiler, M., Taylor, G., and Fergus, R. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, 2011. [3] Fei-fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. IEEE Trans. PAMI, 2006. [4] Bo, L., Ren, X., and Fox, D. Multipath sparse coding using hierarchical matching pursuit. In CVPR, 2013. [5] Griffin, G., Holub, A., and Perona, P. The caltech 256.In Caltech Technical Report, 2006. [6] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541{551, 1989. [7] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le-Cun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009. Others [8] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In NIPS, pp. 153-160, 2007. [9] Berkes, P. and Wiskott, L. On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. NeuralComputation, 2006. [10] Ciresan, D. C., Meier, J., and Schmidhuber, J. Multicolumn deep neural networks for image classication. In CVPR, 2012. [11] Dalal, N. and Triggs, B. Histograms of oriented gradients for pedestrian detection. In CVPR, 2005. [12] Donahue, J., Jia, Y., Vinyals, O., Homan, J., Zhang,N., Tzeng, E., and Darrell, T. DeCAF: A deep convolutional activation feature for generic visualrecognition. In arXiv:1310.1531, 2013. [13] Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of a deep network. In Technical report, University of Montreal, 2009. [14] Gunji, N., Higuchi, T., Yasumoto, K., Muraoka, H., Ushiku, Y., Harada, T., and Kuniyoshi, Y. Classification entry. In Imagenet Competition, 2012. [15] Hinton, G. E., Osindero, S., and The, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554, 2006. [16] Hinton, G.E., Srivastave, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. [17] Sohn, K., Jung, D., Lee, H., and Hero III, A. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011. [18] Torralba, A. and Efros, A. A. Unbiased look at dataset bias. In CVPR, 2011. [19] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096-1103, 2008. [20] Yan, S., Dong, J., Chen, Q., Song, Z., Pan, Y., Xia, W., Huang, Z., Hua, Y., and Shen, S. Generalized hierarchical matching for sub-category aware object classification. In PASCAL VOC Classification Challenge 2012, 2012. [21] Jianchao, Y., Kai, Y., Yihong, G., and Thomas, H. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009. [22] Krizhevsky, A., Sutskever, I., and Hinton, G.E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [23] Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010.

Visualizing Convolutional Networks in ECCV 2014 Article Analysis

Visualizing Convolutional Networks in ECCV 2014 Article Analysis

Presentation Transcript