Understanding Convolutional Networks: A Comprehensive Overview

Goodfellow: Chapter 9Convolutional Networks Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content.

Chapter 9 Sections • Introduction • 1 The Convolution Operation • 2 Motivation • 3 Pooling • 4 Convolution and Pooling as an Infinitely Strong Prior • 5 Variants of the Basic Convolution Function • 6 Structured Outputs • 7 Data Types • 8 Efficient Convolution Algorithms • 9 Random or Unsupervised Features • 10 The Neuroscientific Basis for Convolutional Networks • 11 Convolutional Networks and the History of Deep Learning

Introduction • Convolutional networks are also known as convolutional neural networks (CNNs) • Specialized for data having grid-like topology • 1D grid – time series data • 2D grid – image data • Definition • Convolutional networks use convolution in place of general matrix multiplication in at least one layer • Neural network convolution does not correspond to convolution used in engineering and mathematics

1. The Convolution Operation • Convolution is an operation on two functions • Section begins with general convolution example • Signal smoothing in locating spaceship with a laser sensor • CNN convolutions (not general convolution) • First function is network input x, second is kernel w • Tensors refer to the multidimensional arrays • E.g., input data and parameter arrays, thus TensorFlow • The convolution kernel is usually a sparse matrix in contrast to the usual fully-connected weight matrix

2DConvolution Input Kernel a b c d w x e f g h y z i j k l Output aw + bx + ey + fz bw + cx + fy + gz cw + dx + gy + hz ew + fx + iy + jz fw + gx + jy + kz gw + hx + ky + lz Figure9.1 (Goodfellow2016)

2. Motivation • Convolution leverages three important ideas that help improve machine learning systems • Sparse interactions • Parameter sharing • Equivariant representations

2. MotivationSparse Interactions • Fully connected traditional networks • m inputs in a layer and n outputs in next layer • requires O(m x n) runtime (per example) • Sparse interactions • Also called sparse connectivity or sparse weights • Accomplished by making kernel smaller than input • k << m requires O(k x n) runtime (per example) • k is typically several orders of magnitude smaller than m

SparseConnectivity Sparse connections due to small convolution kernel Viewed from below s1 s2 s3 s4 s5 x1 x2 x3 x4 x5 Dense connections s1 s2 s3 s4 s5 Fully connected x1 x2 x3 x4 x5 (Goodfellow2016)

SparseConnectivity Sparse connections due to small Viewed from above (receptive fields) s1 s2 s3 s4 s5 convolution kernel x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Dense connections Fully connected x1 x2 x3 x4 x5 (Goodfellow2016)

GrowingReceptiveFields g1 g2 g3 g4 g5 h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 Deeper layer units have larger receptive fields (Goodfellow2016)

2. MotivationParameter Sharing • In traditional neural networks • Each element of the weight matrix is unique • Parameter sharing mean using the same parameters for more than one model function • The network has tied weights • Reduces storage requirements to k parameters • Does not affect forward prop runtime O(k x n)

ParameterSharing Black arrows = particular parameter Convolution shares the same parameters across all spatial locations s1 s2 s3 s4 s5 x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Traditional matrix multiplication does not share any parameters x1 x2 x3 x4 x5 (Goodfellow2016)

EdgeDetectionbyConvolution Right image = each orig pixel – left pixel detects edges Input Output -1 1 Kernel k = 2 (Goodfellow2016)

Eﬃciency of Convolution Inputsize:320by280 Kernelsize:2by1 Outputsize:319by280 (Goodfellow2016)

2. MotivationEquivariant Representations • For an invariant function, if the input changes, the output change in the same way • For convolution, a particular form of parameter sharing causes equivariance to translation • For example, as the dog moves in the input image, the detected edges move in the same way • In image processing, detecting edges is useful in the first layer, and edges appear more or less everywhere in the image

3. Pooling • The pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs • Max pooling reports the maximum output within a rectangular neighborhood • Average pooling reports the average output • Pooling helps make the representation approximately invariant to small input translations

Convolutional Network Components Complexlayerterminology Simple layer terminology Next layer Nextlayer Two common terminologies Poolinglayer Convolution layer: Aﬃne transform (Goodfellow2016)

Max Pooling and Invariance to Translation POOLINGSTAGE 1. 1. 1. 0.2 ... ... 0.1 1. 0.2 0.1 ... ... DETECTORSTAGE POOLINGSTAGE 1. 1. Same network with input shifted one pixel to right Little change in pooling stage 0.3 1. ... ... 0.3 0.1 1. 0.2 ... ... DETECTORSTAGE (Goodfellow2016)

Cross-Channel Pooling and Invariance to Learned Transformations Large response in pooling unit Large response in pooling unit Large response in detector unit 3 Large response in detector unit 1 Basically handles rotations Cross-channel pooling is max pooling over different feature maps (Goodfellow2016)

PoolingwithDownsampling 1. 0.2 0.1 0.1 1. 0.2 0.1 0.0 0.1 Max pooling downsized in next layer (Goodfellow2016)

Example Classification Architectures Outputofsoftmax: Outputofsoftmax: Outputofsoftmax: 1,000 class probabilities 1,000 class probabilities 1,000 class probabilities Figure9.11 (Goodfellow2016)

4. Convolution and Pooling as an Infinitely Strong Prior • Prior probabilities (beliefs before we see actual data) can be strong or weak • A weak prior (e.g., Gaussian distribution with high variance) allows the data to move the parameters • A strong prior (e.g., Gaussian distribution with low variance) strongly determines the parameters • An infinitely strong prior controls the parameters • A convolutional net ~ an infinitely strong prior • Weights are zero except in small receptive fields • Weights identical for neighboring hidden units

4. Convolution and Pooling as an Infinitely Strong Prior • Convolution and pooling can cause underfitting • The prior is useful only when the assumptions made by the prior are reasonably accurate • If a task relies on preserving precise spatial information, then pooling can increase training error • The prior imposed by convolution must be appropriate

5. Variants of the Basic Convolution Function • Stride is the amount of downsampling • Can have separate strides in different directions • Zero padding avoids layer-to-layer shrinking • Unshared convolution • Like convolution but without sharing • Partial connectivity between channels • Tiled convolution • Cycle between shared parameter groups

ConvolutionwithStride s1 s2 s3 Stride of two Strided convolution x1 x2 x3 x4 x5 s1 s2 s3 Equivalent to above but computationally wasteful Downsampling Convolution (Goodfellow2016)

ZeroPaddingControlsSize Withoutzero Kernel width of six padding ... ... ... ... ... With zero padding ... ... Prevents shrinking Figure9.13 ... ... (Goodfellow2016)

KindsofConnectivity Local connection: like convolution, but no sharing s1 s2 s3 s4 s5 Unshared convolution a b c d e f g h i x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Convolution a b a b a b a b a x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Fullyconnected x1 x2 x3 x4 x5 Figure9.14 (Goodfellow2016)

PartialConnectivityBetweenChannels OutputTensor Each output channel is a function of only a subset of the input channels InputTensor Channelcoordinates Figure9.15 Spatial coordinates (Goodfellow 2016)

Tiledconvolution s1 s2 s3 s4 s5 Local connection (no sharing) Tiled convolution (cycle between shared parameter groups ) Standard convolution (one group shared everywhere) Figure9.16 (Goodfellow2016)

6. Structured Outputs • Convolutional networks are usually used for classification • They can also be used to output a high-dimensional, structured object • The object is typically a tensor

7. Data Types • Single channel examples: • 1D audio waveform • 2D audio data after Fourier transform • Frequency versus time • Multi-channel examples: • 2D color image data • Three channels: red pixels, green pixels, blue pixels • Each channel is 2D for the image

8. Efficient Convolution Algorithms • Devising faster ways of performing convolution or approximate convolution without harming the accuracy of the model is an active area of research • However, most dissertation work concerns feasibility and not efficiency

9. Random or Unsupervised Features • One way to reduce the cost of convolutional network training is to use features that are not trained in a supervised fashion • Three methods (Rosenblatt used first two) • Simply initialize convolutional kernels randomly • Design them by hand • Learn the kernels using unsupervised methods

10. The Neuroscientific Basisfor Convolutional Networks • Convolutional networks may be the greatest success story of biologically inspired AI • Some of the key design principles of neural networks were drawn from neuroscience • Hubel and Wiesel won the Nobel prize in 1981 for their work on the cat’s visual system 1960s-1970s

10. The Neuroscientific Basisfor Convolutional Networks • Neurons in the retina perform simple processing, don’t change image representation • Image passes through the optic nerve to a brain region called the lateral geniculate body • Signal then reaches visual cortex area V1 • V1 also called the primary visual cortex • The first area of the brain that performs advanced processing of visual input • Located at the back of the head

10. The Neuroscientific Basisfor Convolutional Networks • V1 properties captured in convolutional nets • V1 has a 2D structure mirroring the retina image • V1 contains many simple cells • Each characterized by a linear function of the image in a small, spatially localized receptive field • V1 contains many complex cells • These cells respond to features similar to the simple cells • But invariant to small shifts in the position of the feature • Inspired pooling strategies such as maxout units

10. The Neuroscientific Basisfor Convolutional Networks • Although we know most about area V1, we believe similar principles apply to other areas • Basic strategy of detection followed by pooling • Passing through deeper layers, we find cells responding to specific concepts • These cells are nicknamed “grandmother cells” • The idea being that a neuron activates upon seeing their grandmother anywhere in the image

10. The Neuroscientific Basisfor Convolutional Networks • Reverse correlation • In biological networks we don’t have access to the weights themselves • However, we can put an electrode in a neuron, display images in front of the animal’s retina, and record the activation of the neuron • We can then fit a linear model to these responses to approximate the neuron’s weights • Most V1 cells have weights of Gabor functions

GaborFunctions White = positive weight, black = negative, gray = zero weight (Left) detectors in different orientations, (Center) detectors of increasing width and height, (Right) different sinusoidal params Figure9.18 (Goodfellow2016)

Gabor-likeLearnedKernels (Left) Weights learned by unsupervised learning (Right) Convolutional kernels learned by first layer of fully supervised convolutional maxout network Figure9.19 (Goodfellow2016)

11. Convolutional Networks andthe History of Deep Learning • Convolutional networks have played an important role in the history of deep learning • Application of neuroscience to machine learning • First deep models to perform well • First important commercial applications • Used to win many contests • Some of first deep networks trained with back-prop • Performed well decades ago to pave the way toward acceptance of neural networks in general

11. Convolutional Networks andthe History of Deep Learning • Convolutional networks allow specialized neural networks for grid-structured topology • Most successful on 2D image topology • For 1D sequential data we use recurrent networks

Understanding Convolutional Networks: A Comprehensive Overview