Google Brain Team: Advancing AI Research with TensorFlow

Large-Scale DeepLearning WithTensorFlow Jeff Dean Google Brainteam g.co/brain In collaboration with many other people atGoogle

What is the Google Brain Team? • Research team focused on long term artificial intelligence research • Mix of computer systems and machine learning researchexpertise • Pure ML research, and research in context of emerging ML applicationareas: • robotics, language understanding, healthcare,... • g.co/brain

We Disseminate Our Work in ManyWays • By publishing ourwork • See papers atresearch.google.com/pubs/BrainTeam.html • By releasing TensorFlow, our core machine learning research system, as an open-sourceproject • By releasing implementations of our research models in TensorFlow • By collaborating with product teams at Google to get our research into realproducts

What Do We ReallyWant? • Build artificial intelligence algorithms and systems that learn fromexperience • Use those to solve difficult problems that benefithumanity

What do I mean byunderstanding?

What do I mean byunderstanding? Query [ car parts for sale]

What do I mean byunderstanding? Query [ car parts for sale] Document1 … car parking available for a smallfee. … parts of our floor model inventory forsale. Document2 Selling all kinds of automobile and pickup truck parts, engines, andtransmissions.

Example Needs of theFuture • Which of these eye images shows symptoms of diabetic retinopathy? • Find me all rooftops in NorthAmerica • Describe this video inSpanish • Find me all documents relevant to reinforcement learning for robotics and summarize them inGerman • Find a free time for everyone in the Smart Calendar project to meet and set up avideoconference • Robot, please fetch me a cup of tea from the snackkitchen

Growing Use of Deep Learning atGoogle #ofdirectoriescontainingmodeldescriptionfiles Across many products/areas: Android Apps drugdiscovery Gmail Imageunderstanding Maps Naturallanguage understanding Photos Roboticsresearch Speech Translation YouTube …many others...

Important Property of NeuralNetworks Results get betterwith more data + bigger models + morecomputation (Better algorithms, new insights andimproved techniques always help,too!)

Aside Manyofthetechniquesthataresuccessfulnowwere developed 20-30 yearsago Whatchanged? We nowhave: sufficientcomputationalresources large enough interestingdatasets Use of large-scale parallelism lets us look aheadmany generations of hardware improvements, aswell

What do you want in a machine learningsystem? • Ease of expression: for lots ofcrazy ML ideas/algorithms • Scalability: can run experimentsquickly • Portability:canrunonwidevarietyofplatforms • Reproducibility: easy to share and reproduceresearch • Production readiness: go from research toreal products

Open, standard softwarefor general machinelearning Great for Deep Learningin particular First released Nov2015 http://tensorflow.org/ and Apache 2.0license https://github.com/tensorflow/tensorflow

http://tensorflow.org/whitepaper2015.pdf

Preprint: arxiv.org/abs/1605.08695 Updated version will appear in OSDI2016

Strong ExternalAdoption GitHub Launch Nov.2015 GitHub Launch Sep.2013 GitHub Launch Jan.2012 GitHub Launch Jan.2008 50,000+ binary installs in 72 hours, 500,000+ since November,2015

Strong ExternalAdoption GitHub Launch Nov.2015 GitHub Launch Sep.2013 GitHub Launch Jan.2012 GitHub Launch Jan.2008 50,000+ binary installs in 72 hours, 500,000+ since November,2015 Most forked new repo on GitHub in 2015 (despite only being available in Nov,‘15)

Motivations • DistBelief (our 1st system) was the first scalable deep learning system, but not as flexible as we wanted for researchpurposes • Better understanding of problem space allowed us to make some dramaticsimplifications • Define the industrial standard for machinelearning • Short circuit the MapReduce/Hadoopinefficiency

TensorFlow: Expressing High-Level MLComputations • Core inC++ • Very lowoverhead • Different front ends for specifying/driving thecomputation • Python and C++ today, easy to addmore ... Python front end C++ front end Core TensorFlow Execution System CPU GPU Android iOS ….

Computation is a dataflowgraph Graph of Nodes, also called Operations orops. Relu Xent biases weights Add MatMul examples labels

Computation is a dataflowgraph withtensors Edges are N-dimensional arrays:Tensors biases weights Add Relu MatMul Xent examples labels

Example TensorFlowfragment • Build a graph computing a neural netinference. import tensorflow as tf from tensorflow.examples.tutorials.mnist importinput_data mnist = input_data.read_data_sets('MNIST_data',one_hot=True) x = tf.placeholder("float", shape=[None, 784]) W= tf.Variable(tf.zeros([784,10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b)

Computation is a dataflowgraph withstate 'Biases' is avariable Some ops computegradients −= updatesbiases biases ... Add ... Mul −= learningrate

SymbolicDifferentiation • Automaticallyaddopstocalculatesymbolicgradients of variables w.r.t. lossfunction. • Apply these gradients with an optimizationalgorithm y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = -tf.reduce_sum(y_ *tf.log(y)) opt =tf.train.GradientDescentOptimizer(0.01) train_op = opt.minimize(cross_entropy)

Define graph and then execute itrepeatedly • Launchthegraphandrunthetrainingopsinaloop init = tf.initialize_all_variables() sess =tf.Session() sess.run(init) for i inrange(1000): batch_xs, batch_ys =mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys})

Computation is a dataflowgraph distributed GPU0 CPU biases Assign Sub Add ... Mul ... learningrate

Assign Devices toOps • TensorFlowinsertsSend/RecvOpstotransporttensorsacrossdevices • Recv ops pull data fromSend ops GPU0 CPU biases Send Recv Assign Sub Add ... Mul ... learningrate

Assign Devices toOps • TensorFlowinsertsSend/RecvOpstotransporttensorsacrossdevices • Recv ops pull data fromSend ops GPU0 CPU biases Send Recv Assign Sub Add ... Mul Send Recv ... Recv Send Recv learningrate Send

Experiment Turnaround Time and ResearchProductivity • Minutes,Hours: • Interactive research! Instant gratification! • 1-4days • Tolerable • Interactivity replaced by running many experiments inparallel • 1-4weeks • High value experimentsonly • Progressstalls • >1month • Don’t eventry

DataParallelism ParameterServers ... ... Model Replicas Data

DataParallelism ParameterServers p ... ... Model Replicas Data

DataParallelism ParameterServers ∆p p ... ... Model Replicas Data

DataParallelism p’ = p +∆p ParameterServers ∆p p ... ... Model Replicas Data

DataParallelism p’ = p +∆p ParameterServers p’ Model Replicas ... ... Data

DataParallelism ParameterServers ∆p’ p’ ... ... Model Replicas Data

DataParallelism p’’ = p’ +∆p ParameterServers ∆p’ p’ ... ... Model Replicas Data

Distributed trainingmechanisms Graph structure and low-level graph primitives (queues) allow us to play with synchronous vs. asynchronous updatealgorithms.

Cross process communication is thesame! • Communicationacrossmachinesoverthenetworkabstractedidenticallyto cross devicecommunication. /job:worker/cpu:0 /job:ps/gpu:0 biases Send Recv Assign Sub Add ... Mul Send Recv ... Recv Send Recv learningrate Send No specialized parameter serversubsystem!

Image Model TrainingTime 50GPUs 10GPUs 1GPU Hours

Image Model TrainingTime 50GPUs 10GPUs 2.6 hours vs. 79.3 hours(30.5X) 1GPU Hours

Sync converges faster (time toaccuracy) Synchronous updates (with backup workers) trains to higher accuracy faster Better scaling to more workers (less loss ofaccuracy) Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy Bengio, Raal Jozefowicz, ICLR Workshop 2016,arxiv.org/abs/1604.00981

Sync converges faster (time toaccuracy) 40 hours vs. 50hours Synchronous updates (with backup workers) trains to higher accuracy faster Better scaling to more workers (less loss ofaccuracy) Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy Bengio, Raal Jozefowicz, ICLR Workshop 2016,arxiv.org/abs/1604.00981

GeneralComputations AlthoughweoriginallybuiltTensorFlowforourusesaround deep neural networks, it’s actually quiteflexible Widevarietyofmachinelearningandotherkindsofnumeric computations easily expressible in the computation graph model

Runs on Variety ofPlatforms phones single machines (CPU and/or GPUs)… distributed systems of 100s of machines and/or GPUcards custom MLhardware

Trend: Much More Heterogeneoushardware General purpose CPU performance scaling hasslowed significantly Specializationofhardwareforcertainworkloadswillbemore important

Tensor ProcessingUnit Custom machine learningASIC In production use for >16 months: used on every search query, used for AlphaGo match,... See Google Cloud Platform blog: Google supercharges machine learning tasks with TPU custom chip, by Norm Jouppi, May,2016

Google Brain Team: Advancing AI Research with TensorFlow

Google Brain Team: Advancing AI Research with TensorFlow

Presentation Transcript

Large-scale Machine Learning using DryadLINQ

Large-scale Processing with MapReduce

Large-Scale Machine Learning at Twitter

Large Scale Visualization with ParaView

Efficient Large-Scale Structured Learning

Large-scale Machine Learning using DryadLINQ

Large-scale Deep Unsupervised Learning using Graphics Processors

LARGE SCALE

Dealing with Large Scale Power Emergencies

LARGE-SCALE DISTANCE LEARNING INITIATIVES

Large-Scale Machine Learning: SVM

Large scale

Large Scale Data Processing with DryadLINQ

Large-Scale Computing with Grids

Deep learning and large scale machine learning solutions

Liferay & Machine Learning With tensorFlow

TensorFlow Object Detection | Realtime Object Detection with TensorFlow | TensorFlow Python | Edureka

TensorFlow

Bringing Machine Learning to Mobile Apps with TensorFlow