740 likes | 747 Views
This article presents a holistic approach to improving the efficiency of deep neural network (DNN) training. It includes analysis techniques, optimizations, and benchmarking tools to identify and address performance bottlenecks. The focus is on diverse benchmarking with state-of-the-art models and key performance metrics.
E N D
Holistic Approach to DNN Training Efficiency: Analysis and Optimizations Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group
Overview • Machine Learning Benchmarking and Analysis • Gist: Efficient Data Encoding for DNN Training • EcoRNN: Efficient Training of LSTM RNNs on GPUs • Priority-based Parameter Propagation for Distributed DNN Training
Try a new framework? (TF, MXNet, PyTorch, …) A ML researcher Change hyper-parameters? Try a new library? Buy a new GPU? (V100, P100, 1080 Ti, Titan Xp …) Add/Remove a layer? OR … Never mind, you have to pay this much time anyway? Waiting for hours or days +
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
Why Do We Need ML Benchmark Suite? Lack of a standard diverse benchmark set with the state-of-the-art models for DNN training • How training is different from inference: Our focus is on training
Need for Benchmark Diversity Lack of a standard diverse benchmark set with state-of-the-art models for DNN training • Need for benchmark diversity: • DNNs have been widely successful • Most research used only image classification and CNN models • Performance characteristics are different for different DNNs
State-of-the-art Models ? AlexNet (2012) VGG (2013) GoogleNet (2014) ResNet (2015) RCNN (2014) Fast RCNN (2015) Faster RCNN (2015) YOLO (2016) ? YOLO v2 (2017) State-of-the-art models are constantly evolving Old models can be quickly out-dated
Training Benchmarks for DNNs (TBD) (Footnotes indicate available implementation: T for , M for , C for , P for ) https://github.com/tbd-ai/tbd-suite
TBDvs. Prior Work Comparison against other DNN benchmark suites We aimed (back in late 2016) for a standard DNN benchmark suite like SPEC
Our Focus: Benchmarking and Analysis http://tbd-suite.ai https://mlperf.org/ Building tools to analyze ML performance/efficiency Industry/Academia de-facto standard Our group owns the reference model implementation for speech recognition (inference): DeepSpeech2 from UofT
Our Community Involvement • ASPLOS 2019 and ISCA 2019 tutorials • We lead ISCA 2019 • SysML 2019 demo • MLPerf Academics/Researchers group • Co-chairing
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
Performance Metrics • Throughput # of data samples processed per second • Compute Utilization GPU busy time over Elapsed time • FP16/FP32/TensorCore Utilization Total FP32 instructions over Maximum FP32 instructions • Memory Breakdown Which data structures occupy how much memory
Throughput # of data samples processed per second We assume that there exists such hyper-parameter configuration that guarantees training quality This is the metric that people truly care about Easy to measure Time-to-accuracy Throughput Need to handle samples with variant sizes Too expensive! Hyper-parameter tuning plays a big role
Compute Utilization GPU busy time over Elapsed time • Indicate how well the non-GPU workloads overlap with GPU computation: • Data loading • Communication (PCIe, networking) • …… t1 t2 Compute Utilization = (t1 + t2) / t3 t3
FP32/FP16/TensorCore Utilization • Indicate speed-up potential in kernel-level • Helps identify the “straggler” kernels (usually not MatMul or CNN kernels) • Average # ofinstructions executed per cycle over Maximum instructions per cycle • When GPU is busy, how well are the GPU cores utilized? • Most models are trained with single-precision floats Provided by nvprof
Memory Breakdown • Goal: understand which data structures contribute how much to the total memory consumption • Memory usage can be broken down along two dimensions: Data Structures: Layer Types: • Weights • Gradients • Activations • Workspace • Dynamic • Conv • Recurrent • LSTM • Fully-connected • … Allocated before training starts Allocated and released during training
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
Toolchain: sampling, setup, warmup • Sampling • Fully Training a DNN takes days or weeks • Training algorithm is iterative, each iteration follows the same logic • Setup • Need to verify training accuracy • Different frameworks may use different hyper-parameters for the same models • Skipping warmup • Before training stably, a framework needs to: • Initialize dataflow, allocate memory, auto-tuning
Toolchain: Overview Metrics Setup: make implementations comparable Memory consumption Training logs Memory profiler DNN model implementation FP32/FP16/TensorCore utilization .nvvp file nvprof Post Processing Short training period Warm-up & auto-tuning (excluded from data collection) Sampling Compute utilization .nvvp file nvprof Visual Profiler Training throughput
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
Experimental setup • All results are carried out on the single-machine single-GPU environment • OS: Ubuntu 16.04 • Libraries: CUDA 9, cuDNN 7 • Frameworks: TensorFlow v1.8, MXNet v1.1.0, PyTorch v0.4.0, CNTK v2.0 • GPUs: Quadro P4000, 1080 Ti, Titan Xp, P100, 2080 Ti, Titan V, V100 • CPU: 28-core Intel Xeon E5-2680 • Networking: 1Gb/s ethernet, 100Gb/s infiniband, 12GB/s PCIe
Results: Training Quality Expected training accuracy reached
Results: Throughput Mini-batch size matters for training throughput Performance improves with larger mini-batches
Results: Throughput Diversity Performance of RNN-based models does not saturate within GPU memory budget
Results Analysis: GPU Compute Utilization Mini-batch size should be large enough to keep GPU busy GPU compute utilization is low for LSTM-based models
Results Analysis: GPU FP32 Utilization Mini-batch size should be large enough to have high FP utilization
Hardware Sensitivity Better GPU does NOT always mean better performance and utilization We need better system designs and libraries
GPU Memory Profiling Feature maps are the dominant GPU memory consumers
Results: Distributed Training Training ResNet-50 on MXNet (left: multi-machine; right: multi-GPU on single machine) Ethernet (eth) bw = 1Gb/s; InfiniBand (ib) bw = 100Gb/s; PCIe bw = 16GB/s Networking BW should be large enough for weight/gradient updates
Project Status Github repo: github.com/tbd-ai TBD project website is live: tbd-suite.ai
TBD Summary • A new benchmark suite for DNN training • Currently, 7 application domains, 9 state-of-the-art models • Comes with tools to analyze: • performance, efficiency, memory, and network consumption • Part of the community effort (MLPerf) to standardize benchmarking for machine learning
2. Gist: Efficient Data Encoding for Deep Neural Network Training
DNN Training vs. Inference Step 1 - Forward Pass (makes a prediction) Step 2 - Backward Pass (calculates error gradients) L1 L2 L3 L4 Ln Intermediate layer outputs Feature maps Generated in the forward pass Used in the backward pass DNN training requires stashing feature maps for the backward pass (not required in Inference)
Training Deeper Networks Train larger networks on a single GPU by reducing memory footprint Feature Maps are a major consumer of GPU memory Larger minibatch size potential crash/out-of-memory
Limitations of Prior Work • Focus on DNN inference, i.e., weights • Apply pruning, quantization and Huffman encoding • However, weights are a small fraction of memory footprint • Additionally, techniques are not well suited for training • Training requires frequent weight updates • Map poorly on the GPU HW
Our Insight Forward pass Backward pass Lx Ly Lz Timeline Large temporal gap between 2 uses Feature map Generated 1st use 2nd use Baseline Feature map stored in FP32 format Our approach Smaller format between 2 uses Encode() Decode()
Layer-Specific Encodings • Key Idea: • Use layer-specific compression • Can be both fast and efficient • Can be even lossless • Usually difficult for FP32
Relu Importance Significant footprint is due to Relu layer CNTK Profiling
Relu -> Pool Relu Backward Propagation Binarize – 1 bit representation (Lossless)
Relu/Pool -> Conv Sparse Storage Dense Compute (Lossless)
Opportunity for Lossy Encoding Precision Reduction Error AlexNet : 16-bit doesn’t train L1 L2 L3 L4 Forward pass 2nd uses Precision reduction in forward pass quickly degrades accuracy Backward pass L1 L2 L3 L4 Restricting precision reduction to the 2nd use results in aggressive bit savings with no effect on accuracy
Delayed Precision Reduction Training with Reduced Precision Delayed Precision Reduction (Lossy)
Proposed System Architecture - Gist DNN Modified execution graph Identifies encoding opportunity Gist Efficient memory sharing Execution graph Memory allocation for new data structures
Compression Ratio Up to 2X compression ratio With minimal performance overhead
Gist Summary • Systematic memory breakdown analysis for image classification • Layer-specific lossless encodings • Binarizationand sparse storage/dense compute • Aggressive lossy encodings • With delayed precision reduction • Footprint reduction measured on real systems: • Up to 2X reduction with only 4% performance overhead • Further optimizations – more than 4X reduction