970 likes | 1.02k Views
Large-Scale Deep Learning With TensorFlow. Jeff Dean Google Brain team g.co/brain In collaboration with many other people at Google. What is the Google Brain Team?. Research team focused on long term artificial intelligence research
E N D
Large-Scale DeepLearning WithTensorFlow Jeff Dean Google Brainteam g.co/brain In collaboration with many other people atGoogle
What is the Google Brain Team? • Research team focused on long term artificial intelligence research • Mix of computer systems and machine learning researchexpertise • Pure ML research, and research in context of emerging ML applicationareas: • robotics, language understanding, healthcare,... • g.co/brain
We Disseminate Our Work in ManyWays • By publishing ourwork • See papers atresearch.google.com/pubs/BrainTeam.html • By releasing TensorFlow, our core machine learning research system, as an open-sourceproject • By releasing implementations of our research models in TensorFlow • By collaborating with product teams at Google to get our research into realproducts
What Do We ReallyWant? • Build artificial intelligence algorithms and systems that learn fromexperience • Use those to solve difficult problems that benefithumanity
What do I mean byunderstanding? Query [ car parts for sale]
What do I mean byunderstanding? Query [ car parts for sale] Document1 … car parking available for a smallfee. … parts of our floor model inventory forsale. Document2 Selling all kinds of automobile and pickup truck parts, engines, andtransmissions.
Example Needs of theFuture • Which of these eye images shows symptoms of diabetic retinopathy? • Find me all rooftops in NorthAmerica • Describe this video inSpanish • Find me all documents relevant to reinforcement learning for robotics and summarize them inGerman • Find a free time for everyone in the Smart Calendar project to meet and set up avideoconference • Robot, please fetch me a cup of tea from the snackkitchen
Growing Use of Deep Learning atGoogle #ofdirectoriescontainingmodeldescriptionfiles Across many products/areas: Android Apps drugdiscovery Gmail Imageunderstanding Maps Naturallanguage understanding Photos Roboticsresearch Speech Translation YouTube …many others...
Important Property of NeuralNetworks Results get betterwith more data + bigger models + morecomputation (Better algorithms, new insights andimproved techniques always help,too!)
Aside Manyofthetechniquesthataresuccessfulnowwere developed 20-30 yearsago Whatchanged? We nowhave: sufficientcomputationalresources large enough interestingdatasets Use of large-scale parallelism lets us look aheadmany generations of hardware improvements, aswell
What do you want in a machine learningsystem? • Ease of expression: for lots ofcrazy ML ideas/algorithms • Scalability: can run experimentsquickly • Portability:canrunonwidevarietyofplatforms • Reproducibility: easy to share and reproduceresearch • Production readiness: go from research toreal products
Open, standard softwarefor general machinelearning Great for Deep Learningin particular First released Nov2015 http://tensorflow.org/ and Apache 2.0license https://github.com/tensorflow/tensorflow
Preprint: arxiv.org/abs/1605.08695 Updated version will appear in OSDI2016
Strong ExternalAdoption GitHub Launch Nov.2015 GitHub Launch Sep.2013 GitHub Launch Jan.2012 GitHub Launch Jan.2008 50,000+ binary installs in 72 hours, 500,000+ since November,2015
Strong ExternalAdoption GitHub Launch Nov.2015 GitHub Launch Sep.2013 GitHub Launch Jan.2012 GitHub Launch Jan.2008 50,000+ binary installs in 72 hours, 500,000+ since November,2015 Most forked new repo on GitHub in 2015 (despite only being available in Nov,‘15)
Motivations • DistBelief (our 1st system) was the first scalable deep learning system, but not as flexible as we wanted for researchpurposes • Better understanding of problem space allowed us to make some dramaticsimplifications • Define the industrial standard for machinelearning • Short circuit the MapReduce/Hadoopinefficiency
TensorFlow: Expressing High-Level MLComputations • Core inC++ • Very lowoverhead • Different front ends for specifying/driving thecomputation • Python and C++ today, easy to addmore ... Python front end C++ front end Core TensorFlow Execution System CPU GPU Android iOS ….
Computation is a dataflowgraph Graph of Nodes, also called Operations orops. Relu Xent biases weights Add MatMul examples labels
Computation is a dataflowgraph withtensors Edges are N-dimensional arrays:Tensors biases weights Add Relu MatMul Xent examples labels
Example TensorFlowfragment • Build a graph computing a neural netinference. import tensorflow as tf from tensorflow.examples.tutorials.mnist importinput_data mnist = input_data.read_data_sets('MNIST_data',one_hot=True) x = tf.placeholder("float", shape=[None, 784]) W= tf.Variable(tf.zeros([784,10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b)
Computation is a dataflowgraph withstate 'Biases' is avariable Some ops computegradients −= updatesbiases biases ... Add ... Mul −= learningrate
SymbolicDifferentiation • Automaticallyaddopstocalculatesymbolicgradients of variables w.r.t. lossfunction. • Apply these gradients with an optimizationalgorithm y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = -tf.reduce_sum(y_ *tf.log(y)) opt =tf.train.GradientDescentOptimizer(0.01) train_op = opt.minimize(cross_entropy)
Define graph and then execute itrepeatedly • Launchthegraphandrunthetrainingopsinaloop init = tf.initialize_all_variables() sess =tf.Session() sess.run(init) for i inrange(1000): batch_xs, batch_ys =mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys})
Computation is a dataflowgraph distributed GPU0 CPU biases Assign Sub Add ... Mul ... learningrate
Assign Devices toOps • TensorFlowinsertsSend/RecvOpstotransporttensorsacrossdevices • Recv ops pull data fromSend ops GPU0 CPU biases Send Recv Assign Sub Add ... Mul ... learningrate
Assign Devices toOps • TensorFlowinsertsSend/RecvOpstotransporttensorsacrossdevices • Recv ops pull data fromSend ops GPU0 CPU biases Send Recv Assign Sub Add ... Mul Send Recv ... Recv Send Recv learningrate Send
Experiment Turnaround Time and ResearchProductivity • Minutes,Hours: • Interactive research! Instant gratification! • 1-4days • Tolerable • Interactivity replaced by running many experiments inparallel • 1-4weeks • High value experimentsonly • Progressstalls • >1month • Don’t eventry
DataParallelism ParameterServers ... ... Model Replicas Data
DataParallelism ParameterServers p ... ... Model Replicas Data
DataParallelism ParameterServers ∆p p ... ... Model Replicas Data
DataParallelism p’ = p +∆p ParameterServers ∆p p ... ... Model Replicas Data
DataParallelism p’ = p +∆p ParameterServers p’ Model Replicas ... ... Data
DataParallelism ParameterServers ∆p’ p’ ... ... Model Replicas Data
DataParallelism p’’ = p’ +∆p ParameterServers ∆p’ p’ ... ... Model Replicas Data
DataParallelism p’’ = p’ +∆p ParameterServers ∆p’ p’ ... ... Model Replicas Data
Distributed trainingmechanisms Graph structure and low-level graph primitives (queues) allow us to play with synchronous vs. asynchronous updatealgorithms.
Cross process communication is thesame! • Communicationacrossmachinesoverthenetworkabstractedidenticallyto cross devicecommunication. /job:worker/cpu:0 /job:ps/gpu:0 biases Send Recv Assign Sub Add ... Mul Send Recv ... Recv Send Recv learningrate Send No specialized parameter serversubsystem!
Image Model TrainingTime 50GPUs 10GPUs 1GPU Hours
Image Model TrainingTime 50GPUs 10GPUs 2.6 hours vs. 79.3 hours(30.5X) 1GPU Hours
Sync converges faster (time toaccuracy) Synchronous updates (with backup workers) trains to higher accuracy faster Better scaling to more workers (less loss ofaccuracy) Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy Bengio, Raal Jozefowicz, ICLR Workshop 2016,arxiv.org/abs/1604.00981
Sync converges faster (time toaccuracy) 40 hours vs. 50hours Synchronous updates (with backup workers) trains to higher accuracy faster Better scaling to more workers (less loss ofaccuracy) Revisiting Distributed Synchronous SGD, Jianmin Chen, Rajat Monga, Samy Bengio, Raal Jozefowicz, ICLR Workshop 2016,arxiv.org/abs/1604.00981
GeneralComputations AlthoughweoriginallybuiltTensorFlowforourusesaround deep neural networks, it’s actually quiteflexible Widevarietyofmachinelearningandotherkindsofnumeric computations easily expressible in the computation graph model
Runs on Variety ofPlatforms phones single machines (CPU and/or GPUs)… distributed systems of 100s of machines and/or GPUcards custom MLhardware
Trend: Much More Heterogeneoushardware General purpose CPU performance scaling hasslowed significantly Specializationofhardwareforcertainworkloadswillbemore important
Tensor ProcessingUnit Custom machine learningASIC In production use for >16 months: used on every search query, used for AlphaGo match,... See Google Cloud Platform blog: Google supercharges machine learning tasks with TPU custom chip, by Norm Jouppi, May,2016