850 likes | 1.14k Views
Deep learning with TensorFlow. Machine learning background and TensorFlow implementation. Ido Shamay. Agenda. Machine Learning and Deep learning in general And their relationship. Deep learning in more details Neural networks Training Neural networks Gradient descent
E N D
Deep learning with TensorFlow Machine learning background and TensorFlow implementation Ido Shamay
Agenda • Machine Learning and Deep learning in general • And their relationship. • Deep learning in more details • Neural networks • Training Neural networks • Gradient descent • Distributed deep learning • Model parallelism • Data parallelism • TensorFlow framework. • TensorFlow structure. • Training Abstractions. • Example. • Distributed TensorFlow.
Agenda • What will not be covered • Applications of deep learning / machine learning • There are trillions. • HW implementation of deep learning (TPU, etc..) • Will try to cover some of the GPUs relation to the subject • Advanced mathematical approaches in the convolutional networks • State of the art network implementation. • Recurrent neural networks – LTSM • Systems being used (DGX-1, etc..)
What Is Machine Learning Machine Learning Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed.“ Source: https://en.wikipedia.org/wiki/Machine_learning
Deep Learning • Also known as Deep Neural Network (DNN) • Subset of Artificial Neural Network (ANN) Deep Learning Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks Source: http://machinelearningmastery.com/what-is-deep-learning/
Artificial Intelligence vs Machine Learning vs Deep Learning Artificial Intelligence (AI) “Human Intelligence Exhibited by Machines” • General AI: Machines that have all senses, all our reasons and think just like humans do • Narrow AI: Machines that perform specific tasks as well as, or better than, humans can Machine Learning (ML) “An Approach to Achieve Narrow AI” - Example: Spam Filtering Deep Learning (DL) “An Approach to Achieve Machine Learning using Deep Neural Network” - Example: Image/Video/Speech Recognition, Translation etc.
So why now? • Infrastructure • Recent development in GPU and network technology allow to realize machine learning • Data • More data is generated then ever. Critical for the training process • Software • Wave of open source machine learning framework Cognitive Toolkit
How Software Writes Software? Learning Algorithm “Millions of Trillions of FLOPS” * Stanford example of interpreting photos Little girl is eating piece of cake Computational Workload Replaces Hand Written Code
Deep learning in more details • A single neuron • A neuron receives multiple inputs,has weights, “bias” and a activationfunction. • Output depends on an activation function. • Historically, Perceptron function (step) was used. From there comesthe expression of neuron “firing”. • However does not go well with gradient descent algorithm. • Will be explained shortly. • Examples: sigmoid, tanh, ReLU, perceptons.
Neural networks • Composed from multiple neurons connected with each other in a certain topology. • A neuron has one output which maybe split as multiple inputs to next layer neurons. • First layer is the Input layer, which are represented as neurons,but they are not real neurons, but just output the input to next later. • Output layer has real neurons without an activation functionInstead we have some kind of distribution functionon top of all the output layer. • For example: Softmaxregression. • Fully connected networks – full mesh between eachAdjacent layers. • Very general model, usually doesn’t work well. • Will talk about it soon.
Convolutional Neural networks • Very natural approach for images. • Input layer consists of the image pixels. • Unlike fully connected layers, where there’s nodifference in the network approach to 2 pixel withmiles apart or adjacent. • Convolutional network take spatial advantageof the pixels • Characteristics: • Local receptive fields – creating Feature maps. • Shared weights and biases. • Pooling layers (max/avg/L2 pooling). • Usually involves fully connected layers at the end. • Example: 3 features map with max pool layer. • Can reduce the input image drastically before the full mesh.
Example Convolutional Neural Networks
Learning of object parts Examples of learned object parts from object categories Faces Cars Elephants Chairs
Example – AlexNet • More important – first to use GPUs as computation engine!
Examples for popular training sets • MNIST – written digits 28x28 data set (60K training 10K test) • Hello World of deep learning. • HWDB1.0 dataset – Chinese letters. • CIFAR10 – images 32x32 across 10 categories • 60K images (6000 per class) – 50K training 10K test. • CIFAR100 – images 32x32 across 100 categories (20 super classes) • 60K images (600 per class) – 50K training 10K test • Each image comes with 2 labels (exact class and super class). • IRIS – flower data set containing 3 Iris classes. • Those Irises have (only) 4 different features (length/width of septal etc..) • Very difficult even for humans. • 150 images • ImageNet – most common deep learning data set for models estimator. • Data set contains up to 15M images (usually models uses 1.5M training sets). • Models are evaluated in top-1 and top-5 estimators. • May take up to weeks to train. • Coco – Microsoft’s new training set
Training a neural network • Cost function of supervised networks. • Cost function reflect the current “distance” between the network output to the supplied labels. • The variables of the cost function are the weights and biases (of all neurons). • Cost is computed on all training set (for now, will get back to it soon) • Goal is to minimize the cost function by optimizing variables for the given training set. • This is done through an algorithm called Gradient Descent.
Gradient Descent • Gradient descent Algorithm • Number of variables can reach 10^10 and more. • Finding the minimum analytically not feasible. • Computing the gradient of the function is more feasible. • This is the heart of the computation phase in deep learning. • Update variables with a small step to the direction of local minima. • Learning rate – the step in each variable direction. • Most of the optimizations have per variable adaptive learning rate. • Algorithm usually using backpropagationfor the gradient computation. • Using some of the computation from the feed forward.
Back propagation • The gradient computation of a neural network implementation is called back proporgation.
Gradient Descent modes – mini batch • Batch gradient descent(vanilla) – going over all training set each step. • Usually performs very bad – accuracy doesn’t converge well • The reason is increased probability to reach a “local” minimum (which is not global). • Also not so realistic because it doesn’t fit in memory of the compute device. • Latest changes – will cover soon. • Stochastic gradient descent - Update variables on each training example ! • Much faster and can be used online. • Redundant computations for large data sets. • Does not convergence– location on surface changes rapidally. • Mini batch Gradient descent – Take the best from both approaches. • Choose random mini batch from training set each training step and compute gradient according to them • Going over all training set (training set size / mini batch – training step) is called 1 epoch. • Reduces the variance of the parameter updates • can make use of highly optimized matrix optimizations (thus utilizing compute resources) • Very efficient computation.
Distributed training Deep learning • Needed for training large, powerful models quickly. • Two kinds of parallelism (there are more, but those are the popular approaches) • Model parallelism: • Each node gets a piece of the model. • Usually (best practice) each node will a get a layer of it’s own. • Good for very large networks that won’t fit in GPU memory. • Applicable for inference as well. • Data parallelism: • Same model replicated to different nodes. • Nodes working with same weights/biases for all replicas. • Different mini-batches – mini batch being split among workers. • Gradients computation from all workers are averaged. • Variables are updated after gradient averaging. • Effective mini batch size depends on the paradigm. • Parameter server - all model variables reside in one place,accessible from all workers. • May be distributed as well.
Model and Data Parallelism Data Parallelism Model Parallelism Main Model/Parameter Server Updates Local Model Local Model Local Model Local Model Local Model Local Model Data Data Mini Batch Mini Batch Mini Batch Mini Batch Mini Batch Mini Batch
Data parallelism – Parameter Server. • Synchronous: • Workers fetches model variables (synchronized). • Each compute gradients and send it to parameter server. • PS waits for all workers before averaging gradient. • When all workers send, compute new variables and update. • Asynchronous: • Workers fetch model variables independently. • Each worker compute gradient, send it to the parameter server • PS update the variables for each worker update. • Pros/Cons: • Synchronous: • Not fault tolerant. slow worker determine pace. • Converges like non distributed execution. • Mini batch is the aggregated mini batch of the workers. • Asynchronous • Worker may work with stale parameters so gradient update is not optimal. • Can diverge a lot. • Fault tolerant. Each worker independent and works on the effective mini batch.
TensorFlow – open source framework by Google • Second generation deep learning system from Google. (based on DistBeleif framework) • Very popular framework at the moment. • Activity among researchers in mailing list. • Contains thousands of features with an excellent software design. • Front-end in python/C++ for model definition. • Runtime backend in C++, linear algebra packages and GPU support. • Model design transparent to infrastructure/scale: • Abstracts away the underlying hardware. • Same programs for different infrastructure.
TensorFlow – API • Extensive API and library with good abstractions: • Mathematical operations • High level operations (convolutions, pooling etc..) • Standard losses (CrossEntropy, L1, etc..) • Various optimizers (SGD, AdaGrad, etc..) • Auto Differentiation. • Easy to experiment. • Built in distribution flexible engine – to be explained shortly.
TensorFlow Graph • Based on Data flow Graph. • Very abstract – does not care about runtime. • Operations: • Nodes in graph – represent the processing/computation. • Edges represent the tensors (data). • Construction phase – just building the network (no training yet) • Tensors: • This is the only data object in TensorFlow. • Tensors flows through the operation in a graph. • N-dimensional arrays. • Each tensor has rank, shape and a dimension number. • Example - MNIST training set [55000, 784] shape tensor. • Example - First convolutional layer - [5, 5, 1, 32] shape tensor.
TensorFlow Graph • Variables - persistent memory tensors. • Maintain state for tensors across executions of the graph. • Mostly used for model parameters in the training phase. • Updated with “assign” operation – usually abstracted by the training API of TensorFlow. • This is the model parameters in fact. • This is how optimizers know how to optimize.
TensorFlow Training abstraction. • Training is very easy in TensorFlow. • No need to implement anything. • Supply SGD optimization and cost function and it willperform one train step automatically. • Support for all optimizers and popular implementations. • Support for all popular cost functions. • No need to update variables – done automatically. • Minimize function takes care for both compute/apply. • This can be split to insert some logic. • This is where I use today MPI Allreduce. • Built in support for name scope. • How is it done? • Tensorflow add nodes for the gradient computation. • Implements the back propogationby exending the graph.
TensorFlow Session • Sessions - Actual runtime lifting of a graph. • Translate graph definition to executable operation distributed across devices (compute resources). • Devices – CPU/GPUs. • Nodes/operations placement on devices. • Queues – each device is assign with a queue. • Operation which receives all its incoming tensors is placed in device’s queue. • Output tensor passed as input tensor to the relevant operation. • May add SEND/RECEIVE nodes – to be explained in distributed. • Fetches and feeds. • Parameters of the session run method. • Fetches – instruct which Tensor to fetch – may result in partial execution ! • Feeds – input variable of a model. • So this gives flexibility to implement all distributed paradigm.
Example – MNIST training 0 hidden layers • importtensorflowastf • x =tf.placeholder(name="x",[None,784]) • y_ =tf.placeholder(tf.float32,[None,10]) • b =tf.Variable(tf.zeros([10])) • W =tf.Variable(tf.random_uniform([784,10],-1,1)) • y =tf.nn.softmax(tf.matmul(x, W)+ b) • cross_entropy=tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y),reduction_indices=[1])) • train_step=tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy) • init=tf.initialize_all_variables() • sess=tf.Session() • sess.run(init) • for _ in range(1000): • batch_xs,batch_ys=mnist.train.next_batch(100) • sess.run(train_step,feed_dict={x:batch_xs, y_:batch_ys})
Under the hood • TensorFlow uses protobuf anywhere • Single node execution explanation.
Under the hood • Multi device execution.
Distributed TensorFlow • Tensor flow has built in support for distributed execution. • SuperVisor API. • Session Manager – for storing checkpoint in distributed environment. • Also the flexible Architecture it has, allows us to implement any paradigm we want. • In graph replication • Single TensorFlow Graph that contains one set of parameters and multiple copies of the compute-intensive part of the model, each pinned to a different task. • Between graph replication • Separate client/Graph for each task, typically in the same process as the worker task.Each client builds a similar graph containing the parameters (pinned to PS job) and a single copy of the compute-intensive part of the model, pinned to the local task. • Building a Tensorflowcluster: • Run TensorFlow server program, with one or more worker, on each node. • Usually Between graph replication implies using the distribution engine, and in graph for flexible model design. • In both we can implement Asynchronous/Synchronus paradigms.
Distributed TensorFlow Runtime • The SEND/RECV nodes communicate with each otherby an abstract Rendezvous interface, supplied to the SEND/RECVkernels in their construction phase by the session. • Rendezvousinterface- the abstract Rendezvous interface defines: • A non-blocking Send operation, which receive the Tensor it needs to send and its device information (which device holds this tensor, memory allocator for this tensor memory). • An asynchronous Receive operation which is supplied with a callback function (to be called when the Tensor is ready) and device information of where to place this Tensor when calling the callback. • The SEND/RECV nodes use the above functionally of the abstract Rendezvous interface, without really knowing which Rendezvous implementation is being used (Implementation is chosen by the session).