Optimizing DNN Training Efficiency: A Holistic Approach

Holistic Approach to DNN Training Efficiency: Analysis and Optimizations Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group

1. Machine Learning Benchmarking and Analysis

Try a new framework? (TF, MXNet, PyTorch, …) A ML researcher Change hyper-parameters? Try a new library? Buy a new GPU? (V100, P100, 1080 Ti, Titan Xp …) Add/Remove a layer? OR … Never mind, you have to pay this much time anyway? Waiting for hours or days +

A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

Why Do We Need ML Benchmark Suite? Lack of a standard diverse benchmark set with the state-of-the-art models for DNN training • How training is different from inference: Our focus is on training

Need for Benchmark Diversity(early 2017) Lack of a standard diverse benchmark set with state-of-the-art models for DNN training • Need for benchmark diversity: • DNNs have been widely successful • Most research used only image classification and CNN models • Performance characteristics are different for different DNNs

State-of-the-art Models ? AlexNet (2012) VGG (2013) GoogleNet (2014) ResNet (2015) RCNN (2014) Fast RCNN (2015) Faster RCNN (2015) YOLO (2016) ? YOLO v2 (2017) State-of-the-art models are constantly evolving Old models can be quickly out-dated

Training Benchmarks for DNNs (TBD) (Footnotes indicate available implementation: T for , M for , C for , P for ) https://github.com/tbd-ai/tbd-suite

Our Focus: Benchmarking and Analysis http://tbd-suite.ai https://mlperf.org/ Building tools to analyze ML performance/efficiency Industry/Academia de-facto standard Our group is responsible for the reference model implementation for speech recognition (inference): DeepSpeech2 from UofT

Performance Metrics • Throughput # of data samples processed per second • Compute Utilization GPU busy time over Elapsed time • FP32 Utilization Total FP32 instructions over Maximum FP32 instructions • Memory Breakdown Which data structures occupy how much memory

Throughput # of data samples processed per second We assume that there exists such hyper-parameter configuration that guarantees training quality This is the metric that people truly care about Easy to measure Time-to-accuracy Throughput Need to handle samples with variant sizes Too expensive! Hyper-parameter tuning plays a big role

Compute Utilization GPU busy time over Elapsed time • Indicate how well the non-GPU workloads overlap with GPU computation: • Data loading • Communication (PCIe, networking) • …… t1 t2 Compute Utilization = (t1 + t2) / t3 t3

FP32/FP16/TensorCore Utilization • Indicate speed-up potential in kernel-level • Helps identify the “straggler” kernels (usually not MatMul or CNN kernels) • Average # ofinstructions executed per cycle over Maximum instructions per cycle • When GPU is busy, how well are the GPU cores utilized? • Most models are trained with single-precision floats Provided by nvprof

Memory Breakdown • Goal: understand which data structures contribute how much to the total memory consumption • Data Structures: • Weights • Gradients • Activations • Workspace • Dynamic Allocated before training starts Allocated and released during training

Toolchain: How to get the required metrics?

Toolchain: sampling, setup, warmup • Sampling • Fully Training a DNN takes days or weeks • Training algorithm is iterative, each iteration follows the same logic • Setup • Need to verify training accuracy • Different frameworks may use different hyper-parameters for the same models • Skipping warmup • Before training stably, a framework needs to: • Initialize dataflow, allocate memory, auto-tuning

Toolchain: Overview Metrics Setup: make implementations comparable Memory consumption Training logs Memory profiler CPU utilization DNN model implementation vTune Short training period FP32 utilization .nvvp file nvprof Warm-up & auto-tuning (excluded from data collection) Sampling Compute utilization .nvvp file Training throughput

Memory Profiler Flow Data Structures Weights Modified framework Parser program Gradients Training logs with functionality tagged DNN model implementation Activations Workspace Dynamic

Experimental setup • All results are carried out on the single-machine single-GPU environment • OS: Ubuntu 16.04 • Libraries: CUDA 9, cuDNN 7 • Frameworks: TensorFlow v1.8, MXNet v1.1.0, PyTorch v0.4.0, CNTK v2.0 • GPUs: Quadro P4000, 1080 Ti, Titan Xp, P100, 2080 Ti, Titan V, V100 • CPU: 28-core Intel Xeon E5-2680 • Networking: 1Gb/s ethernet, 100Gb/s infiniband, 16GB/s PCIe

Results: Training Quality Expected training accuracy reached

Results: Throughput Mini-batch size matters for training throughput Performance improves with larger mini-batches

Results: Throughput Diversity Performance of RNN-based models does not saturate within GPU memory budget

Results Analysis: GPU Compute Utilization Mini-batch size should be large enough to keep GPU busy GPU compute utilization is low for LSTM-based models

Results Analysis: GPU FP32 Utilization Mini-batch size should be large enough to have high FP utilization

Hardware Sensitivity Better GPU does NOT always mean better performance and utilization We need better system designs and libraries

GPU Memory Profiling Feature maps are the dominant GPU memory consumers

Results: Distributed Training Training ResNet-50 on MXNet (left: multi-machine; right: multi-GPU on single machine) Ethernet (eth) bw = 1Gb/s; InfiniBand (ib) bw = 100Gb/s; PCIe bw = 16GB/s Networking BW should be large enough for weight/gradient updates

Project Status Github repo: github.com/tbd-ai TBD project website is live: tbd-suite.ai

TBD Summary • A new benchmark suite for DNN training • Currently, 7 application domains, 9 state-of-the-art models • Comes with tools to analyze: • performance, efficiency, memory, and network consumption • Part of the community effort (MLPerf) to standardize benchmarking for machine learning

2. Gist: Efficient Data Encoding for Deep Neural Network Training

DNN Training vs. Inference Step 1 - Forward Pass (makes a prediction) Step 2 - Backward Pass (calculates error gradients) L1 L2 L3 L4 Ln Intermediate layer outputs Feature maps Generated in the forward pass Used in the backward pass DNN training requires stashing feature maps for the backward pass (not required in Inference)

Training Deeper Networks Train larger networks on a single GPU by reducing memory footprint Feature Maps are a major consumer of GPU memory Larger minibatch size  potential crash/out-of-memory

Limitations of Prior Work • Focus on DNN inference, i.e., weights • Apply pruning, quantization and Huffman encoding • However, weights are a small fraction of memory footprint • Additionally, techniques are not well suited for training • Training requires frequent weight updates • Map poorly on the GPU HW

Our Insight Forward pass Backward pass Lx Ly Lz Timeline Large temporal gap between 2 uses Feature map Generated 1st use 2nd use Baseline Feature map stored in FP32 format Our approach Smaller format between 2 uses Encode() Decode()

Layer-Specific Encodings • Key Idea: • Use layer-specific compression • Can be both fast and efficient • Can be even lossless • Usually difficult for FP32

Relu Importance Significant footprint is due to Relu layer CNTK Profiling

Relu -> Pool Relu Backward Propagation Binarize – 1 bit representation (Lossless)

Relu/Pool -> Conv Sparse Storage Dense Compute (Lossless)

Opportunity for Lossy Encoding Precision Reduction Error AlexNet : 16-bit doesn’t train L1 L2 L3 L4 Forward pass 2nd uses Precision reduction in forward pass quickly degrades accuracy Backward pass L1 L2 L3 L4 Restricting precision reduction to the 2nd use results in aggressive bit savings with no effect on accuracy

Delayed Precision Reduction Training with Reduced Precision Delayed Precision Reduction (Lossy)

Proposed System Architecture - Gist DNN Modified execution graph Identifies encoding opportunity Gist Efficient memory sharing Execution graph Memory allocation for new data structures

Compression Ratio Up to 2X compression ratio With minimal performance overhead

Gist Summary • Systematic memory breakdown analysis for image classification • Layer-specific lossless encodings • Binarizationand sparse storage/dense compute • Aggressive lossy encodings • With delayed precision reduction • Footprint reduction measured on real systems: • Up to 2X reduction with only 4% performance overhead • Further optimizations – more than 4X reduction

3. Priority-based Parameter Propagation (P3) for Distributed DNN Training

Networking Profiler for DNN Training Significant communication, especially, from workers to parameter server

Network Utilization: ResNet-50 on MXNet 4 Gbps Ethernet Huge network underutilization, with significant spikes

Optimizing DNN Training Efficiency: A Holistic Approach

Optimizing DNN Training Efficiency: A Holistic Approach

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7