Exploring Deep Neural Networks in Parallel Systems for Efficient Computing

Parallel Systems to Compute Deep Neural Networks Carlos Ordonez 1

Authorities in the field • Geoffrey Hinton (U Toronto, Google): Non-linearity, Backpropagation, relu, Boltzmann machine • Y. LeCunn (NYU, Facebook, USA): first deep net that recognized digits, learning rate, back prop. • A. Ng (Stanford, USA): multicore, parallel deep netss • M. Jordan (UC Berkeley, USA): LDA, clustering • J. Dean (Google, USA): Parallel processing • Z. Ghahramani (Cambridge, UK): Linear Gaussian models • Y. Li (Alibaba, China): Computer vision

Acknowledgments • E. Omiecinski, advisor (my 1st paper, on image mining based on Blobworld-UC Berkeley) • J. Dean, Google (inspiring talk) • G. Hinton and Z. Ghahramani: early contact with ML • M. Stonebraker, MIT (large arrays) • V. Baladandayuthapani (Bayesian stats) • My PhD student: Sikder Tahsin Al-Amin • My colleagues at UH (50% of them are working on deep learning) 3

Success of deep nets in AI problems • Signal: speech recognition (voice) • Image: computer vision (digits, image classif.) • Language: beyond IR, natural language

Learning performance • d

Popular libraries • Pytorch (Facebook, USA) • Tensorflow (Google, USA), C++, distributed memory • Keras • Caffe (UC/Berkeley, USA)

Deep Neural net • Input: data set • Output: weights or probabilities • Neuron activation f(): sigmoid, tanh, relu • Weights+biases • Loss function: quadratic in regress,; classif. error • Optional: filters (convolution, most common) • Deep nets can be stacked

Classification of NNs • Shallow: 1 or 2 layers • Deep: 3-10, 10-100, 100-1000 • Convoluted or recurrent

Basic neuron model

Foundation: logistic regression • x

Computation • Input: data set • Iterations • f() evaluation • Loss (fitness) function • Forward propagation • Backward propagation • Convolution (filters) • Dropping neurons

Data Set • A matrix: n vectors of d dimensions (not features!) • vector xi perhaps labeled • feature engineering (variable creation) • automated feature creation (in contrast to manual feature creation) • Domain knowledge absolutely necessary • Benchmark data sets: LeNet, CIFAR

Classical activation functions f(): sigmoid and tanh

Forward propagation

Backward propagation

Typical convolution • Convolutional:

Aspects that impact computation time • Raw input, but pre-processing can reduce size: images/signal (sample, compress.) or text (stemming, IR) • Big data • f(), non-linear (linear algebra optimizations not feasible) • Large # of matrix multiplications • Large # of iterations needed to achieve overfit • Connectivity: Dense vs sparse connected layers, but dynamic • Convolution: depends on filter size

Big Data Aspects • Signal: large time series databases with words • Image: 1000s of images, where each image is a large matrix..digit recognition is considered solved • Language: 1000s of documents

Transforming sigmoid into relu

Modern activation functions f(): relu and variations

Layers • Fully/sparsely connected • Filters: convolution, FFT

Fully connected layers • sds

Convolutional layer • X

Controlling overfit: regression

Controlling overfit: classification

Dropping neurons: randomly 1/2 • X

Optimizations and acceleration • Gradient descent • MAC: matrix multiplication • More compact network • Sparsely connected layers (dropping) • Threshold on # of weights that contribute to yi • Early stopping • Weight sharing • parallel processing • Filters (convolution): FFT to reduce O() of matrix *

Finding optimal weightsAcceleration with gradient descent

Examples • a

Overfit & early stopping: # of iterations

Floating point bottlenecks • Matrix multiplication • Basic operation MAC: multiply and accumulate, similar dgemm() in LAPACK

Parallel Computation • CPU: multiple threads in cores, share L1 or L2 cache • GPU: many cores, attached processor+memory • TPU: purpose-specific • Distributed: multiple CPUs, each CPU with its own RAM • Shared-nothing: not common, netowrk communication in RAM, I/O cost generally ignored • In short, it looks more like a traditional MPI cluster

Parallel data systems: architecture • Shared-nothing, message-passing • P machines (nodes) • Data partitioned before computation: load time • Examples: Parallel DBMSs, Hadoop HDFS, MapReduce, Spark

Hardware acceleration • Modifying floating point computations • DRAM • SRAM: basic ALU ops in RAM • LSTM • Non-volatile memory: in-place, reduce: precision, # of writes

Modifying floating point computations • Reduce floating point precision • Reduce # of matrix multiplications

Tensorflow: generalizing operations

Tensorflow: distributed computation

Tensorflow replication: data parallelism • d

Conclusions • Data set and neural net must fit in RAM (single machine or distributed memory) • Raw data preferred since net learns features • Large # of matrix operations: matrix-vector, matrix-matrix, filters, sums • Many iterations needed to decrease loss • Parallel processing: essential

Future work • Bigger deep nets, beyond RAM • TPUs beyond GPUs • Big data: not images, not language • Interpreting weights and biases via traditional statistics • Bayesian methods • Generative linear models have more solid theory

Exploring Deep Neural Networks in Parallel Systems for Efficient Computing

Exploring Deep Neural Networks in Parallel Systems for Efficient Computing

Presentation Transcript

Artificial Neural Networks: parallel distributed processors adaptive systems self-organizing systems neurocomputers con

CSC 578 Neural Networks and Deep Learning

Introduction to Neural Networks

CSC 578 Neural Networks and Deep Learning

Neural Networks and Deep Learning

Predicting Signal Peptides using Deep Neural Networks

Introduction to Neural Networks

Introduction to Neural Networks

Introduction to Neural Networks

Introduction to Neural Networks

Introduction to Neural Networks

Introduction to Neural Networks

Introduction to Neural Networks

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

Introduction to Neural Networks

Deep Neural Networks for Supervised Speech Separation