1 / 28

Sparsity-Aware Data Parallelism in Deep Neural Networks

Learn about the impact of sparsity in DNNs and how to optimize communication costs through sparsity-aware architectures. Discover the hybrid approach of utilizing parameter servers and allreduce methods for efficient distributed training.

rconley
Download Presentation

Sparsity-Aware Data Parallelism in Deep Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks Eurosys 2019 Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, EunjiJeong, Hyeonmin Ha, Sanha Lee, JooSeongJeong, Byung-Gon Chun Seoul National University, Software Platform Lab Presented by Youhui Bai, USTC, ADSL July 3rd, 2019 ADSL

  2. Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL

  3. How deep learning works? Deep Neural Networks (DNNs) ADSL

  4. Train your models first! ADSL

  5. Why Distributed DNN Training? • Limited computing power on single nodes • Dataset : imagenet 1.2 million samples • Oper = FLOPs • NVIDIA GeForce GTX 1080 Ti speed = 11.34 TFLOPS (float 32) • Time = Oper / s Source: https://github.com/albanie/convnet-burden ADSL

  6. Model vs. Data parallelism • Data parallelism is a primary choice. • But how to efficiently exchange updates to the global model? ADSL

  7. Ring allreduce vs. parameter server • Ring-allreducehas better network bandwidth utilization Parameter server [1] Ring-allreduce [2] [1] Mu Li, David G Andersen, etc. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014. [2] https://developer.nvidia.com/nccl, NVIDIA Collective Communications Library Research status @ ADSL

  8. Ring-allReduce - Concept Node 1 c1 a1 b1 a1 b1 c1+c3 a1 b1+b2+b3 c1+c3 Node 2 c2 a2 b2 a1+a2 b2 c2 a1+a2 b2 c1+c2+c3 Node 3 c3 a3 b3 a3 b2+b3 c3 a1+a2+a3 b2+b3 c3 Computed gradients Scatter-Reduce Let be the number of nodes, be the message size. allreducesend cost per node is: allgather: AllGather a1+a2+a3 b1+b2+b3 c1+c2+c3 a1+a2+a3 b1+b2+b3 c1+c3 a1+a2+a3 b1+b2+b3 c1+c2+c3 a1+a2 b1+b2+b3 c1+c2+c3 a1+a2+a3 b1+b2+b3 c1+c2+c3 a1+a2+a3 b2+b3 c1+c2+c3 ADSL

  9. Outline • Background • Sparsity identification and impact • Sparsity-aware architecture • Evaluation • Conclusion ADSL

  10. Identify the sparsity – Bert[1] as example encoder 0 token embed self-attention word embed classification feed-forward encoder 11 pooler position embed [1]Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. 2018. ADSL

  11. Word embedding of Bert Let be the batch size, be the length of input sentence, the utilization of vocabulary is: ADSL

  12. Sparse and dense models Taking advantage of sparsity to reduce the communication overhead ADSL

  13. Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL

  14. Communication cost when using PS • For single bytes message pull: push: total: pull: push: total: ADSL

  15. Communication cost when using PS • For m bytes message, message per machine • As worker: • send: • recv: • As server: • send: • recv: • Total: • Sparse message: Server Server Server Worker Worker Worker Machine Machine Machine ADSL

  16. Communication cost when using AR • For single bytes message Dense using AllReduce Sparse using AllGather • For m bytes message ADSL

  17. Communication cost PS for Sparse Hybrid? AR for Dense ADSL

  18. Hybrid architecture of Parallax • one server process per machine • one worker process per GPU ADSL

  19. Partition sparse variables in PS • Partition for load balance and memory fit in PS • Pros: Partition for paralleling computation on sparse array • Cons: Overhead for stitching and managing ADSL

  20. Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL

  21. Experiment setups • Hardware: 8 machines • two 18-core Intel Xeon E5-2695 @2.10GHz processors, 256 GB DRAM • 6 NVIDIA GeForce TITAN Xp GPU cards • Mellanox ConnectX4 cards with 100 GbpsInfiniband • Software: • TensorFlow v1.6, Horovod v0.11.2 • Ubuntu 16.04, CUDA9.0, cuDNN 7, OpenMPI 3.0.0, NCCL 2.1 • Model and datasets • image classification • ResNet-50, Inception-v3, training on ImageNet(ILSVRC 2012) • NLP • LM on One Billion Word Benchmark, NMT on WMT English-German ADSL

  22. Model Convergence ADSL

  23. Training throughput 2.8x vs PS ADSL

  24. Scalability of Parallax 38.33% 19.58% ADSL

  25. Effect of hybrid architecture ADSL

  26. Outline • Background • Sparsity identification • Sparsity-aware architecture • Evaluation • Conclusion ADSL

  27. Conclusion • Sparsity in popular DNN models • Different architectures support better for different data type • Hybrid PS and AR, PS for sparsity and AR for density ADSL

  28. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks Thanks & QA! July 3rd, 2019 ADSL

More Related