170 likes | 190 Views
Investigating the performance of deep CNNs with unlimited GPU memory using a memory-agnostic bijective module. Efficient training without additional memory usage for storing activations.
E N D
Recursive Convolutionsof Arbitrary Depth Venkat Santhanam, Vlad Morariu, Larry Davis University of Maryland
Outline • The JANUS pipeline (like almost every other vision problem) benefits from very deep CNNs. • Major DCNN bottleneck: GPU memory. Question: If we had unlimited GPU memory, is there any depth beyond which performance is truly saturated ? We investigate this question using a novel class of memory agnostic bijective DCNNs modules which can be plugged between any 2 layers in any arbitrary neural network architecture.
GPU Memory Usage in DCNNs In “convolutional” neural networks, the memory required for weights is negligible compared to memory required for activations. Very large intermediate activations need to be stored in DCNNs which becomes the major limiting factor for further increasing depth. However, if some layer in the network is bijective (invertible) and its inverse can be efficiently computed, we can avoid caching its intermediate activation and instead recompute it when needed.
Recursive Convolutions of Arbitrary Depth Consider any neural network ℵ with K layers. Consider a linear bijective module of arbitrary depth N consisting of: Convolution: orthogonal weights. Invertible activation function. Batch Normalization.
Recursive Convolutions of Arbitrary Depth Sandwich the linear module below between any 2 connected layers. The augmented network can be efficiently trained without requiring any additional memory for storing activations!
Analysis of the Linear Module 1x1 Convolution with orthogonal weights Orthogonal weights used with great success in Recurrent NNs. Preserve norms: no vanishing/exploding gradients. Activation Function Use ReLU with negative slope (Leaky ReLU). Negative slope ensures that module is invertible. Batch Normalization Further improves gradient stability; faster convergence. All 3 layers are trivially invertible, i.e. during backward pass, intermediate activations are cheaply recomputed on the fly. Both forward and backward passes are efficiently computed “in-place” without using any buffer.
Experiments – CIFAR100 classification CIFAR100 : standard dataset used as a benchmark for image classification. 32 x 32 color images, 100 mutually exclusive classes, 50k training images, 10k test images. Almost every DCNN-based classification architecture published to date has been trained & tested on it.
Experiments – CIFAR100 classification WRN-28-10 model* gives current state-of-the-art accuracy on CIFAR100 dataset without any pre-processing/ data augmentation. *Sergey Zagoruyko, Nikos Komodakis, Wide Residual Networks, BMVC 2016. WRN architecture almost identical to standard ResNets. WRN has a much lower depth but a substantially large number of channels per activation. WRN-D-K: D conv layers with 4 residual blocks each respectively having 16, 16*K, 32*K and 64*K channels.Dropout is added after every alternative convolution.
Experiments – CIFAR100 classification Initial Experiment: Add proposed bijective module (with N = 10) between each convolution layer LA and LB in each residual block of WRN-28-10-dropout. Effectively increases network depth by ~10x without any memory increase. Resulting architecture: WRN-28-10-orth-10.
Experiments – CIFAR100 classification All networks trained using SGD + Nesterov Momentum Mini-batch size: 128, base LR: 0.1, wt-decay: 0.0005 Trained for 240 epochs, dropping LR by 0.2 every 60, 120, 160, 200 epochs.
Results – CIFAR100 classification • WRN-28-10-orth-10 outperforms WRN-28-10 by 0.42%, giving current state-of-the-art results on CIFAR100 (without pre-processing or data augmentation). • Outperforms very deep 1000+ layer ResNet architectures. • The bijective module enables an increased depth at a very high width, which is otherwise unfeasibly expensive with standard convolutions.
Future Work Initial CIFAR100 experiment: proof of concept. Lots of scope for improvement. Regularization is challenging for orthogonal weights since standard weight decay can’t be used. Perhaps a higher dropout rate or more dropout layers can help with regularization.
Future Work Other architectural variations include Changing the location where the module is inserted within the network. Changing the relative ordering of batch-norm, dropout and activation-function. Insert the module in other competitive yet diverse architectures like DenseNets, DiracNets, FractalNets, etc
Future Work The module should prove very useful for very large datasets like ImageNet where Increased depth almost always improves results. Larger training image sizes lead to more stringent memory requirements. Or tasks like Image to Image Regression where Downsampling is highly detrimental making deep architectures infeasible due to large activation sizes. Finally, given a good DCNN architecture for JANUS data, we can simply augment it with the bijective module without any memory increase.