300 likes | 345 Views
Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks. Eurosys 2019 Soojeong Kim , Gyeong -In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong , Hyeonmin Ha, Sanha Lee, Joo Seong Jeong , Byung-Gon Chun Seoul National University, Software Platform Lab
E N D
Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks Eurosys 2019 Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, EunjiJeong, Hyeonmin Ha, Sanha Lee, JooSeongJeong, Byung-Gon Chun Seoul National University, Software Platform Lab Presented by Youhui Bai, USTC, ADSL July 3rd, 2019 ADSL
Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL
How deep learning works? Deep Neural Networks (DNNs) ADSL
Why Distributed DNN Training? • Limited computing power on single nodes • Dataset : imagenet 1.2 million samples • Oper = FLOPs • NVIDIA GeForce GTX 1080 Ti speed = 11.34 TFLOPS (float 32) • Time = Oper / s Source: https://github.com/albanie/convnet-burden ADSL
Model vs. Data parallelism • Data parallelism is a primary choice. • But how to efficiently exchange updates to the global model? ADSL
Ring allreduce vs. parameter server • Ring-allreducehas better network bandwidth utilization Parameter server [1] Ring-allreduce [2] [1] Mu Li, David G Andersen, etc. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014. [2] https://developer.nvidia.com/nccl, NVIDIA Collective Communications Library Research status @ ADSL
Ring-allReduce - Concept Node 1 c1 a1 b1 a1 b1 c1+c3 a1 b1+b2+b3 c1+c3 Node 2 c2 a2 b2 a1+a2 b2 c2 a1+a2 b2 c1+c2+c3 Node 3 c3 a3 b3 a3 b2+b3 c3 a1+a2+a3 b2+b3 c3 Computed gradients Scatter-Reduce Let be the number of nodes, be the message size. allreducesend cost per node is: allgather: AllGather a1+a2+a3 b1+b2+b3 c1+c2+c3 a1+a2+a3 b1+b2+b3 c1+c3 a1+a2+a3 b1+b2+b3 c1+c2+c3 a1+a2 b1+b2+b3 c1+c2+c3 a1+a2+a3 b1+b2+b3 c1+c2+c3 a1+a2+a3 b2+b3 c1+c2+c3 ADSL
Outline • Background • Sparsity identification and impact • Sparsity-aware architecture • Evaluation • Conclusion ADSL
Identify the sparsity – Bert[1] as example encoder 0 token embed self-attention word embed classification feed-forward encoder 11 pooler position embed [1]Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. 2018. ADSL
Word embedding of Bert Let be the batch size, be the length of input sentence, the utilization of vocabulary is: ADSL
Sparse and dense models Taking advantage of sparsity to reduce the communication overhead ADSL
Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL
Communication cost when using PS • For single bytes message pull: push: total: pull: push: total: ADSL
Communication cost when using PS • For m bytes message, message per machine • As worker: • send: • recv: • As server: • send: • recv: • Total: • Sparse message: Server Server Server Worker Worker Worker Machine Machine Machine ADSL
Communication cost when using AR • For single bytes message Dense using AllReduce Sparse using AllGather • For m bytes message ADSL
Communication cost PS for Sparse Hybrid? AR for Dense ADSL
Hybrid architecture of Parallax • one server process per machine • one worker process per GPU ADSL
Partition sparse variables in PS • Partition for load balance and memory fit in PS • Pros: Partition for paralleling computation on sparse array • Cons: Overhead for stitching and managing ADSL
Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL
Experiment setups • Hardware: 8 machines • two 18-core Intel Xeon E5-2695 @2.10GHz processors, 256 GB DRAM • 6 NVIDIA GeForce TITAN Xp GPU cards • Mellanox ConnectX4 cards with 100 GbpsInfiniband • Software: • TensorFlow v1.6, Horovod v0.11.2 • Ubuntu 16.04, CUDA9.0, cuDNN 7, OpenMPI 3.0.0, NCCL 2.1 • Model and datasets • image classification • ResNet-50, Inception-v3, training on ImageNet(ILSVRC 2012) • NLP • LM on One Billion Word Benchmark, NMT on WMT English-German ADSL
Model Convergence ADSL
Training throughput 2.8x vs PS ADSL
Scalability of Parallax 38.33% 19.58% ADSL
Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL
Conclusion • Sparsity in popular DNN models • Different architectures support better for different data type • Hybrid PS and AR, PS for sparsity and AR for density ADSL
Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks Thanks & QA! July 3rd, 2019 ADSL