1 / 54

SAGE: Self-Tuning Approximation for Graphics Engines

SAGE: Self-Tuning Approximation for Graphics Engines. Mehrzad Samadi 1 , Janghaeng Lee 1 , D. Anoushe Jamshidi 1 , Amir Hormati 2 , and Scott Mahlke 1 University of Michigan 1 , Google Inc. 2 December 2013. University of Michigan Electrical Engineering and Computer Science.

toviel
Download Presentation

SAGE: Self-Tuning Approximation for Graphics Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1 University of Michigan1, Google Inc.2 December 2013 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

  2. Approximate Computing • Different domains: • Machine Learning • Image Processing • Video Processing • Physical Simulation • … Quality: 100% 95% 90% 85% Higher performance Lower power consumption Less work

  3. Ubiquitous Graphics Processing Units • Wide range of devices • Mostly regular applications • Works on large data sets Super Computers Servers Desktops Cell Phones Good opportunity for automatic approximation

  4. SAGE Framework Speedup 2~3x • Simplify or skip processing • Computationally expensive • Lowest impact on the output quality • Self-Tuning Approximation on Graphics Engines • Write the program once • Automatic approximation • Self-tuning dynamically 10% 10% Output Error

  5. Overview

  6. SAGE Framework Static Compiler Input Program Approximate Kernels Approximation Methods Runtime system Tuning Parameters

  7. Static Compilation Atomic Operation Evaluation Metric Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

  8. Runtime System CPU Preprocessing Preprocessing GPU Calibration Tuning Calibration Tuning Execution Tuning T T + C T + 2C Quality TOQ Speedup Time

  9. Runtime System CPU Preprocessing GPU Calibration Tuning Execution T Tuning T T + C Quality TOQ Speedup Time

  10. Approximation Methods

  11. Approximation Methods Atomic Operation Evaluation Metric Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

  12. Atomic Operations • Atomic operations update a memory location such that the update appears to happen atomically // Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) inttid= threadIdx.x+ blockDim.x * blockIdx.x; intnThreads = gridDim.x * blockDim.x; for ( inti = tid ; tid < n; tid += nThreads) intc = colors[i]; atomicAdd(&bucket[c], 1);

  13. Atomic Operations Threads

  14. Atomic Operations Threads

  15. Atomic Operations Threads

  16. Atomic Operations Threads

  17. Atomic Add Slowdown

  18. Atomic Operation Tuning // Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) inttid= threadIdx.x+ blockDim.x * blockIdx.x; intnThreads = gridDim.x * blockDim.x; for ( inti = tid ; tid < n; tid += nThreads) intc = colors[i]; atomicAdd(&bucket[c], 1); Iterations T0 T32 T64 T128 T96 T160

  19. Atomic Operation Tuning • SAGE skips one iteration per thread • To improve the performance, it drops the • iteration with the maximum number of conflicts Iterations T0 T32 T64 T128 T96 T160

  20. Atomic Operation Tuning • SAGE skips one iteration per thread • To improve the performance, it drops the • iteration with the maximum number of conflicts It drops 50% of iterations T0 T32 T64 T128 T96 T160

  21. Atomic Operation Tuning Drop rate goes down to 25% T0 T32 T64

  22. Dropping One Iteration Iteration No. After optimization Conflicts Conflict Detection CD 0 0 2 CD 1 Max conflict 0 1 Iteration 1 Iteration 0 8 Iteration 2 CD 2 1 2 17 CD 3 3 3 12

  23. Approximation Methods Atomic Operation Evaluation Metric Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

  24. Data Packing Threads Memory

  25. Data Packing Threads Memory

  26. Data Packing Threads Memory

  27. Data Packing Threads Memory

  28. Quantization • Preprocessing finds min and max of the input sets and packs the data • During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation Quantization Levels 0 1 2 3 4 5 6 7 Min Max 101

  29. Quantization • Preprocessing finds max and min of the input sets and packs the data • During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation 101 0 1 2 3 4 5 6 7 Min Max

  30. Approximation Methods Atomic Operation Evaluation Metric Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

  31. Thread Fusion ≈ ≈ ≈ ≈ Memory Threads Memory

  32. Thread Fusion Memory Threads Memory

  33. Thread Fusion Computation Output writing Computation Output writing T0 T1

  34. Thread Fusion Computation Output writing T0 T1 T254 T255 T1 T0 T254 T255 Block 1 Block 0

  35. Thread Fusion Reducing the number of threads per block results in poor resource utilization T0 T127 T0 T127 Block 0 Block 1

  36. Block Fusion T255 T0 T1 T254 Block 0 & 1 fused

  37. Runtime

  38. How to Compute Output Quality? Evaluation Metric Accurate Version Approximate Version • High overhead • Tuning should find a good enough • approximate version as fast as possible • Greedy algorithm

  39. Tuning TOQ = 90% Quality = 100% Speedup = 1x K(0,0) Tuning parameter of the First optimization Tuning parameter of the Second optimization K(x,y)

  40. Tuning TOQ = 90% Quality = 100% Speedup = 1x K(0,0) 94% 1.15X 96% 1.5X K(1,0) K(0,1)

  41. Tuning TOQ = 90% Quality = 100% Speedup = 1x K(0,0) 94% 1.15X 96% 1.5X K(1,0) K(0,1) 94% 2.5X 95% 1.6X K(1,1) K(0,2)

  42. Tuning TOQ = 90% Quality = 100% Speedup = 1x K(0,0) 94% 1.15X 96% 1.5X Tuning Path K(1,0) K(0,1) Final Choice 94% 2.5X 95% 1.6X K(1,1) K(0,2) 87% 2.8X 88% 3X K(2,1) K(1,2)

  43. Evaluation

  44. Experimental Setup • Backend of Cetus compiler • GPU • NVIDIA GTX 560 • 2GB GDDR 5 • CPU • Intel Core i7 • Benchmarks • Image processing • Machine Learning

  45. K-Means TOQ Output Quality Accumulative Speedup

  46. Confidence Beta Uniform Binomial After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold

  47. Calibration Overhead Calibration Overhead(%)

  48. Performance TOQ = 95% TOQ = 90% Loop Perforation Loop Perforation SAGE SAGE 6.4 Geometric Mean

  49. Conclusion • Automatic approximation is possible • SAGE automatically generates approximate kernels with different parameters • Runtime system uses tuning parameters to control the output quality during execution • 2.5x speedup with less than 10% quality loss compared to the accurate execution

  50. SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1 University of Michigan1, Google Inc.2 December 2013 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

More Related