1 / 61

Holistic Approach to DNN Training Efficiency: Analysis and Optimizations

Holistic Approach to DNN Training Efficiency: Analysis and Optimizations. Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group. 1. Machine Learning Benchmarking and Analysis. Try a new framework? (TF, MXNet , PyTorch , …). A ML researcher.

derrj
Download Presentation

Holistic Approach to DNN Training Efficiency: Analysis and Optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Holistic Approach to DNN Training Efficiency: Analysis and Optimizations Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group

  2. 1. Machine Learning Benchmarking and Analysis

  3. Try a new framework? (TF, MXNet, PyTorch, …) A ML researcher Change hyper-parameters? Try a new library? Buy a new GPU? (V100, P100, 1080 Ti, Titan Xp …) Add/Remove a layer? OR … Never mind, you have to pay this much time anyway? Waiting for hours or days +

  4. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  5. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  6. Why Do We Need ML Benchmark Suite? Lack of a standard diverse benchmark set with the state-of-the-art models for DNN training • How training is different from inference: Our focus is on training

  7. Need for Benchmark Diversity(early 2017) Lack of a standard diverse benchmark set with state-of-the-art models for DNN training • Need for benchmark diversity: • DNNs have been widely successful • Most research used only image classification and CNN models • Performance characteristics are different for different DNNs

  8. State-of-the-art Models ? AlexNet (2012) VGG (2013) GoogleNet (2014) ResNet (2015) RCNN (2014) Fast RCNN (2015) Faster RCNN (2015) YOLO (2016) ? YOLO v2 (2017) State-of-the-art models are constantly evolving Old models can be quickly out-dated

  9. Training Benchmarks for DNNs (TBD) (Footnotes indicate available implementation: T for , M for , C for , P for ) https://github.com/tbd-ai/tbd-suite

  10. Our Focus: Benchmarking and Analysis http://tbd-suite.ai https://mlperf.org/ Building tools to analyze ML performance/efficiency Industry/Academia de-facto standard Our group is responsible for the reference model implementation for speech recognition (inference): DeepSpeech2 from UofT

  11. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  12. Performance Metrics • Throughput # of data samples processed per second • Compute Utilization GPU busy time over Elapsed time • FP32 Utilization Total FP32 instructions over Maximum FP32 instructions • Memory Breakdown Which data structures occupy how much memory

  13. Throughput # of data samples processed per second We assume that there exists such hyper-parameter configuration that guarantees training quality This is the metric that people truly care about Easy to measure Time-to-accuracy Throughput Need to handle samples with variant sizes Too expensive! Hyper-parameter tuning plays a big role

  14. Compute Utilization GPU busy time over Elapsed time • Indicate how well the non-GPU workloads overlap with GPU computation: • Data loading • Communication (PCIe, networking) • …… t1 t2 Compute Utilization = (t1 + t2) / t3 t3

  15. FP32/FP16/TensorCore Utilization • Indicate speed-up potential in kernel-level • Helps identify the “straggler” kernels (usually not MatMul or CNN kernels) • Average # ofinstructions executed per cycle over Maximum instructions per cycle • When GPU is busy, how well are the GPU cores utilized? • Most models are trained with single-precision floats Provided by nvprof

  16. Memory Breakdown • Goal: understand which data structures contribute how much to the total memory consumption • Data Structures: • Weights • Gradients • Activations • Workspace • Dynamic Allocated before training starts Allocated and released during training

  17. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  18. Toolchain: How to get the required metrics?

  19. Toolchain: sampling, setup, warmup • Sampling • Fully Training a DNN takes days or weeks • Training algorithm is iterative, each iteration follows the same logic • Setup • Need to verify training accuracy • Different frameworks may use different hyper-parameters for the same models • Skipping warmup • Before training stably, a framework needs to: • Initialize dataflow, allocate memory, auto-tuning

  20. Toolchain: Overview Metrics Setup: make implementations comparable Memory consumption Training logs Memory profiler CPU utilization DNN model implementation vTune Short training period FP32 utilization .nvvp file nvprof Warm-up & auto-tuning (excluded from data collection) Sampling Compute utilization .nvvp file Training throughput

  21. Memory Profiler Flow Data Structures Weights Modified framework Parser program Gradients Training logs with functionality tagged DNN model implementation Activations Workspace Dynamic

  22. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  23. Experimental setup • All results are carried out on the single-machine single-GPU environment • OS: Ubuntu 16.04 • Libraries: CUDA 9, cuDNN 7 • Frameworks: TensorFlow v1.8, MXNet v1.1.0, PyTorch v0.4.0, CNTK v2.0 • GPUs: Quadro P4000, 1080 Ti, Titan Xp, P100, 2080 Ti, Titan V, V100 • CPU: 28-core Intel Xeon E5-2680 • Networking: 1Gb/s ethernet, 100Gb/s infiniband, 16GB/s PCIe

  24. Results: Training Quality Expected training accuracy reached

  25. Results: Throughput Mini-batch size matters for training throughput Performance improves with larger mini-batches

  26. Results: Throughput Diversity Performance of RNN-based models does not saturate within GPU memory budget

  27. Results Analysis: GPU Compute Utilization Mini-batch size should be large enough to keep GPU busy GPU compute utilization is low for LSTM-based models

  28. Results Analysis: GPU FP32 Utilization Mini-batch size should be large enough to have high FP utilization

  29. Hardware Sensitivity Better GPU does NOT always mean better performance and utilization We need better system designs and libraries

  30. GPU Memory Profiling Feature maps are the dominant GPU memory consumers

  31. Results: Distributed Training Training ResNet-50 on MXNet (left: multi-machine; right: multi-GPU on single machine) Ethernet (eth) bw = 1Gb/s; InfiniBand (ib) bw = 100Gb/s; PCIe bw = 16GB/s Networking BW should be large enough for weight/gradient updates

  32. Project Status Github repo: github.com/tbd-ai TBD project website is live: tbd-suite.ai

  33. TBD Summary • A new benchmark suite for DNN training • Currently, 7 application domains, 9 state-of-the-art models • Comes with tools to analyze: • performance, efficiency, memory, and network consumption • Part of the community effort (MLPerf) to standardize benchmarking for machine learning

  34. 2. Gist: Efficient Data Encoding for Deep Neural Network Training 

  35. DNN Training vs. Inference Step 1 - Forward Pass (makes a prediction) Step 2 - Backward Pass (calculates error gradients) L1 L2 L3 L4 Ln Intermediate layer outputs Feature maps Generated in the forward pass Used in the backward pass DNN training requires stashing feature maps for the backward pass (not required in Inference)

  36. Training Deeper Networks Train larger networks on a single GPU by reducing memory footprint Feature Maps are a major consumer of GPU memory Larger minibatch size  potential crash/out-of-memory

  37. Limitations of Prior Work • Focus on DNN inference, i.e., weights • Apply pruning, quantization and Huffman encoding • However, weights are a small fraction of memory footprint • Additionally, techniques are not well suited for training • Training requires frequent weight updates • Map poorly on the GPU HW

  38. Our Insight Forward pass Backward pass Lx Ly Lz Timeline Large temporal gap between 2 uses Feature map Generated 1st use 2nd use Baseline Feature map stored in FP32 format Our approach Smaller format between 2 uses Encode() Decode()

  39. Layer-Specific Encodings • Key Idea: • Use layer-specific compression • Can be both fast and efficient • Can be even lossless • Usually difficult for FP32

  40. Relu Importance Significant footprint is due to Relu layer CNTK Profiling

  41. Relu -> Pool Relu Backward Propagation Binarize – 1 bit representation (Lossless)

  42. Relu/Pool -> Conv Sparse Storage Dense Compute (Lossless)

  43. Opportunity for Lossy Encoding Precision Reduction Error AlexNet : 16-bit doesn’t train L1 L2 L3 L4 Forward pass 2nd uses Precision reduction in forward pass quickly degrades accuracy Backward pass L1 L2 L3 L4 Restricting precision reduction to the 2nd use results in aggressive bit savings with no effect on accuracy

  44. Delayed Precision Reduction Training with Reduced Precision Delayed Precision Reduction (Lossy)

  45. Proposed System Architecture - Gist DNN Modified execution graph Identifies encoding opportunity Gist Efficient memory sharing Execution graph Memory allocation for new data structures

  46. Compression Ratio Up to 2X compression ratio With minimal performance overhead

  47. Gist Summary • Systematic memory breakdown analysis for image classification • Layer-specific lossless encodings • Binarizationand sparse storage/dense compute • Aggressive lossy encodings • With delayed precision reduction • Footprint reduction measured on real systems: • Up to 2X reduction with only 4% performance overhead • Further optimizations – more than 4X reduction

  48. 3. Priority-based Parameter Propagation (P3) for Distributed DNN Training 

  49. Networking Profiler for DNN Training Significant communication, especially, from workers to parameter server

  50. Network Utilization: ResNet-50 on MXNet 4 Gbps Ethernet Huge network underutilization, with significant spikes

More Related