540 likes | 635 Views
Spatial Transformer Networks. Authors : Max Jaederberg, Karen Simonyan,Andrew Zisserman and Koray Kavukcuoglu Google DeepMind Neural Information Processing Systems (NIPS) 2015. Presented by: Asher Fredman. Motivation.
E N D
Spatial Transformer Networks Authors: Max Jaederberg, Karen Simonyan,Andrew Zisserman and Koray Kavukcuoglu Google DeepMind Neural Information Processing Systems (NIPS) 2015 Presented by: Asher Fredman
Motivation In order to successfully classify objects, Neural networks should be invariant to spatial transformations. But are they? • Scale? • Rotation? • Translation?
Spatial transformations aspect rotation translation perspective cylindrical affine
No. CNN’s and RNN’s are not really invariant to most spatial transformation. • Scale? • Rotation? • Translation? • RNN’s – no. • CNN’s – sort of… No.
Max-pooling layer The figure above shows the 8 possible translations of an image by one pixel. The 4x4 matrix represents 16 pixel values of an example image. When applying 2x2 max-pooling, the maximum of each colored rectangle is the new entry of the respective bold rectangle. As one can see, the output stays identical for 3 out of 8 cases, making the output to the next layer somewhat translation invariant. When performing 3x3 max-pooling, 5 out of 8 translation directions give an identical result, which means that the translation invariance rises with the size of the max-pooling. As described before, translation is only the most simple scenario of a geometric transformation. Other transformations listed in the table above can only be handled by the spatial transformer module.
Visualizing CNNs Harley, Adam W. "An interactive node-link visualization of convolutional neural networks." International Symposium on Visual Computing. Springer International Publishing, 2015.
So we would like our NN’s to be implemented in a way, that the output of the network is invariant to the position, size and rotation of an object in the image. How can we achieve this? Spatial Transformer Module!
SpatialTransformer • A dynamic mechanism that actively spatially transforms an image or feature map by learning appropriate transformation matrix. • Transformation matrix is capable of including translation, rotation, scaling, cropping. • Allows for end to end trainable models using standard back-propagation. • Further they allow the localization of objects in an image and a sub-classification of an object, such as distinguishing between the body and the head of a bird in an unsupervised manner.
Architecture Architecture of a spatial transformer module
Anything strange? Why not just transform image directly??
Image Warping quick overview In order to understand the different components of the spatial transformer module, we need to first understand the idea behind image inverse warping. Image warping is the process of digitally manipulating an image such that any shapes portrayed in the image have been significantly distorted. Warping may be used for correcting image distortion.
Imagewarping T(x,y) y’ y x x’ f(x,y) g(x’,y’) Given a coordinate transform (x’,y’) = T(x,y) and a source image f(x,y), how do we compute a transformed image g(x’,y’) = f(T(x,y))?
Forwardwarping T(x,y) y’ y x f(x,y) x’ g(x’,y’) Send each pixel f(x,y) to its corresponding location (x’,y’) = T(x,y) in the second image Q: what if pixel lands “between” two pixels?
Forwardwarping T(x,y) y’ y x f(x,y) x’ g(x’,y’) Send each pixel f(x,y) to its corresponding location (x’,y’) = T(x,y) in the second image Q: what if pixel lands “between” two pixels? A: distribute color among neighboring pixels (x’,y’) Known as “splatting”
Inversewarping T-1(x,y) y’ y x f(x,y) x’ g(x’,y’) Get each pixel g(x’,y’) from its corresponding location (x,y) = T-1(x’,y’) in the first image Q: what if pixel comes from “between” two pixels?
Inversewarping T-1(x,y) y’ y x f(x,y) x’ g(x’,y’) Get each pixel g(x’,y’) from its corresponding location (x,y) = T-1(x’,y’) in the first image Q: what if pixel comes from “between” two pixels? A: Interpolate color value from neighbors • nearest neighbor, bilinear, Gaussian, bicubic
Q: which is better? • A: usually inverse—eliminates holes • however, it requires an invertible warp function—not always possible... Forward vs. inversewarping
Image resampling is the process of geometrically transforming digital images Resampling x u y v This is a 2D signal reconstruction problem!
Resampling Compute weighted sum of pixel neighborhood - Weights are normalized values of kernel function - Equivalent to convolution at samples with kernel - Find good filters using ideas of previous lectures x u v y
Point Sampling • Copies the color of the pixel with the closest integer coordinate • A fast and efficient way to process textures if the size of the target is similar to the size of the reference • Otherwise, the result can be a chunky, aliased, or blurred image. Nearest neighbor x u y v
Bilinear Filter x u Weighted sum of four neighboring pixels y v
Unit square - If we choose a coordinate system in which the four points where f is known are (0, 0), (0, 1), (1, 0), and (1, 1), then the interpolation formula simplifies to: f(x,y) f(0,0) (1-x)(1-y) + f(1,0) x(1-y) + f(0,1) (1-x)y + f(1,1) xy Bilinear Filter
y Bilinear Filter (i,j) (i,j+1) x (i+1,j+1) (i+1,j)
Back to the spatial transformer .. Architecture of a spatial transformer module
LocalisationNetwork • Takes in feature map U ∈ RH×W×C and outputs parameters of the transformation. • Can be realized using fully-connected or convolutional networks regressing the transformation parameters.
Parameterised Sampling Grid (GridGenerator) • Generates sampling grid by using the transformation predicted by the localization network.
Parameterised Sampling Grid (GridGenerator) • AttentionModel: Source transformedgrid Target regulargrid Identity Transform (s=1, tx=0, ty = 0)
Parameterised Sampling Grid (GridGenerator) • Affinetransform: Source transformedgrid Target regulargrid
Differentiable Image Sampling(Sampler) Samples the input feature map by using the sampling grid and produces the output map.
Mathematical Formulation ofSampling Kernels • Integer samplingkernel • Bilinear samplingkernel
Backpropagation through SamplingMechanism • Gradient with bilinear samplingkernel
Experiments • Distorted versions of the MNIST handwriting dataset for classification • A challenging real-world dataset, Street View House Numbers for number recognition • CUB-200-2011 birds dataset for fine-grained classification by using multiple parallel spatial transformers
Experiments • MNIST data that has been distorted in various ways: rotation (R), rotation, scale and translation (RTS), projective transformation (P), and elastic warping (E). • Baseline fully-connected (FCN) and convolutional (CNN) neural networks are trained, as well as networks with spatial transformers acting on the input before the classification network (ST-FCN and ST-CNN).
Experiments • The spatial transformer networks all use different transformation functions: an affine (Aff), projective (Proj), and a 16-point thin plate spline transformations (TPS)
Experiments • Affine Transform (error%) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1.5 1.4 1.2 1.2 0.8 0.8 CNN ST-CNN 0.7 0.5 R RTS P E
Experiments • Projective Transform (error%) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1.5 1.4 1.3 1.2 0.8 0.8 0.8 CNN ST-CNN 0.6 R RTS P E
Experiments • Thin Plate Spline (error%) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1.5 1.4 1.2 1.1 0.8 0.8 CNN ST-CNN 0.7 0.5 R RTS P E
DistortedMNIST Results R: Rotation RTS : Rotation, Translation andScaling • : Inputs to thenetwork • : Transformation applied by the Spatial TransformerNetwork • : Output of the Spatial TransformerNetwork P :Projective Distortion E: Elastic Distortion
Experiments • Street View House Numbers (SVHN) • This dataset contains around 200k real world images of house numbers, with the task to recognise the sequence of numbers in each image
Experiments • Data is preprocessed by taking 64 × 64 crops and more loosely 128×128 crops around each digit sequence
Experiments • Comperative results (error %) • 6 5.6 5 4.5 4 4 3.9 3.73.9 3.9 3.6 4 3 64 128 2 1 0 Maxout CNN Ours DRAM ST-CNN Single ST-CNN Multi
Experiments • Fine-Grained Classification • CUB-200-2011 birds dataset contains 6k training images and 5.8k test images, covering 200 species of birds. • The birds appear at a range of scales and orientations, are not tightly cropped. • Only image class labels are used for training.
Experiments • Baseline CNN model is an Inception architecture with batch normalisation pretrained on ImageNet and fine-tuned on CUB. • It achieved the state-of-theart accuracy of 82.3% (previous best result is 81.0%). • Then, spatial transformer network, ST-CNN, which contains 2 or 4 parallel spatial transformers are trained.
Experiments • The transformation predicted by 2×ST-CNN (top row) and 4×ST-CNN (bottom row)
Experiments • One of the transformers learns to detect heads, while the other detects the body.