400 likes | 419 Views
Parallel Systems to Compute Deep Neural Networks. Carlos Ordonez. 1. Authorities in the field. Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop.
E N D
Parallel Systems to Compute Deep Neural Networks Carlos Ordonez 1
Authorities in the field • Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine • Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. • A. Ng (Stanford, USA): multicore, parallel deep netss • M. Jordan (UC Berkeley, USA): LDA, clustering • J. Dean (Google, USA): Parallel processing • Z. Ghahramani (Cambridge, UK): Linear Gaussian models • Y. Li (Alibaba, China): Computer vision
Acknowledgments • E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) • J. Dean, Google (inspiring talk) • G. Hinton and Z. Ghahramani: early contact with ML • M. Stonebraker, MIT (large arrays) • V. Baladandayuthapani (Bayesian stats) • My PhD student: Sikder Tahsin Al-Amin • My colleagues at UH (50% of them are working on deep learning) 3
Success of deep nets in AI problems • Signal: speech recognition (voice) • Image: computer vision (digits, image classif.) • Language: beyond IR, natural language
Popular libraries • Pytorch (Facebook, USA) • Tensorflow (Google, USA), C++, distributed memory • Keras • Caffe (UC/Berkeley, USA)
Deep Neural net • Input: data set • Output: weights or probabilities • Neuron activation f(): sigmoid, tanh, relu • Weights+biases • Loss function: quadratic in regress,; classif. error • Optional: filters (convolution, most common) • Deep nets can be stacked
Classification of NNs • Shallow: 1 or 2 layers • Deep: 3-10, 10-100, 100-1000 • Convoluted or recurrent
Computation • Input: data set • Iterations • f() evaluation • Loss (fitness) function • Forward propagation • Backward propagation • Convolution (filters) • Dropping neurons
Data Set • A matrix: n vectors of d dimensions (not features!) • vector xi perhaps labeled • feature engineering (variable creation) • automated feature creation (in contrast to manual feature creation) • Domain knowledge absolutely necessary • Benchmark data sets: LeNet, CIFAR
Typical convolution • Convolutional:
Aspects that impact computation time • Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) • Big data • f(), non-linear (linear algebra optimizations not feasible) • Large # of matrix multiplications • Large # of iterations needed to achieve overfit • Connectivity: Dense vs sparse connected layers, but dynamic • Convolution: depends on filter size
Big Data Aspects • Signal: large time series databases with words • Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved • Language: 1000s of documents
Layers • Fully/sparsely connected • Filters: convolution, FFT
Fully connected layers • sds
Optimizations and acceleration • Gradient descent • MAC: matrix multiplication • More compact network • Sparsely connected layers (dropping) • Threshold on # of weights that contribute to yi • Early stopping • Weight sharing • parallel processing • Filters (convolution): FFT to reduce O() of matrix *
Examples • a
Floating point bottlenecks • Matrix multiplication • Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK
Parallel Computation • CPU: multiple threads in cores, share L1 or L2 cache • GPU: many cores, attached processor+memory • TPU: purpose-specific • Distributed: multiple CPUs, each CPU with its own RAM • Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored • In short, it looks more like a traditional MPI cluster
Parallel data systems: architecture • Shared-nothing, message-passing • P machines (nodes) • Data partitioned before computation: load time • Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark
Hardware acceleration • Modifying floating point computations • DRAM • SRAM: basic ALU ops in RAM • LSTM • Non-volatile memory: in-place, reduce: precision, # of writes
Modifying floating point computations • Reduce floating point precision • Reduce # of matrix multiplications
Conclusions • Data set and neural net must fit in RAM (single machine or distributed memory) • Raw data preferred since net learns features • Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums • Many iterations needed to decrease loss • Parallel processing: essential
Future work • Bigger deep nets, beyond RAM • TPUs beyond GPUs • Big data: not images, not language • Interpreting weights and biases via traditional statistics • Bayesian methods • Generative linear models have more solid theory