1 / 18

Communication-Minimizing 2D Convolution in GPU Registers

Communication-Minimizing 2D Convolution in GPU Registers . Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer. forresti@eecs.berkeley.edu. University of California, Berkeley. Overview.

kelii
Download Presentation

Communication-Minimizing 2D Convolution in GPU Registers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. MangpoPhothilimthana Kurt Keutzer forresti@eecs.berkeley.edu University of California, Berkeley

  2. Overview • Convolution is a recurring computational pattern in a broad range of computer vision applications • Memory communication is the bottleneck for convolution on modern GPUs • How to minimize memory communication overhead in convolution • Texture cache • Loop blocking • Up to 4.5x speedup over existing GPU implementations from NVIDIA, OpenCV, and others Forrest Iandola forresti@eecs.berkeley.edu

  3. Why focus on convolution? • Berkeley ParLab project identified 15 recurring computational patterns in computer vision CVPR 2007 – 2011 object recognition track • Small filters (2x2 – 7x7) • Feature extraction • Sliding-window object detection • If we want fast computer vision, we need fast convolution 15 Computer Vision Patterns Forrest Iandola forresti@eecs.berkeley.edu

  4. What limits the performance of convolution? • Roofline model [1] divides a program’s execution time into two parts: • Computational cost (GFLOPS/s) • Communication cost (GB/s) – memory traffic, I/O, etc. • No program can outperform the hardware boundon computation or communication [1] S. Williams, A. Waterman, D. Patterson. Roofline: An Insightful Visual Performance Model for Floating Point Programs and Multicore Architectures. Communications of the ACM, 2009 Forrest Iandola forresti@eecs.berkeley.edu

  5. What limits the performance of convolution? Roofline Model of computational performance Fast Computation Bounded Memory Bounded Slow Forrest Iandola forresti@eecs.berkeley.edu

  6. What limits the performance of convolution? • Convolution on NVIDIA GPUs: • Communication between the GPU’s off-chip DRAM and on-chip caches is the bottleneck • This doesn’t include communication between the CPU and GPU, though this can also be an issue • If we want fast computer vision, we need fast convolution. • If we want fast convolution on GPUs, we need to optimize memory communication. Forrest Iandola forresti@eecs.berkeley.edu

  7. Exploiting the GPU Memory Architecture Memory per GPU Multiprocessor Registers L1 Cache / Shared Memory 893 GB/s 129 Gtexels/s Threads 123 GB/s L2 Cache GPU Global Memory (DRAM) Texture Cache Optimization 1: Use the Texture Cache 8 GB/s NVIDIA GTX680 CPU DRAM

  8. Data Reuse with Loop Blocking 1 output pixel 9 input pixels Typical Implementation: no data reuse at the register level Forrest Iandola forresti@eecs.berkeley.edu

  9. Data Reuse with Loop Blocking Optimization 2: Block the image in registers 4 output pixels 1 output pixel 4 inputs per output 9 input pixels 16 input pixels Typical Implementation: no data reuse at the register level Our approach: reuse data by doing more work per thread Forrest Iandola forresti@eecs.berkeley.edu

  10. Comparison with Related Work Inverse roofline model NVIDIA GTX680 (Kepler)

  11. Comparison with Related Work With texture cache and blocking (ours) NVIDIA GTX680 (Kepler)

  12. Comparison with Related Work NVIDIA GTX680 (Kepler)

  13. Comparison with Related Work NVIDIA GTX680 (Kepler)

  14. Comparison with Related Work NVIDIA GTX680 (Kepler)

  15. Comparison with Related Work NVIDIA GTX680 (Kepler)

  16. Comparison with Related Work 4.5x speedup NVIDIA GTX680 (Kepler)

  17. Are we done? • Are we done optimizing memory communication? • I think so. We achieved the memory bandwidth bound for small filters. • Future work: optimize computationsome more! Forrest Iandola forresti@eecs.berkeley.edu

  18. Conclusions • If we want fast computer vision, we need fast convolution. • If we want fast convolution on GPUs, we need to optimize memory communication. • Up to 4.5x faster than existing GPU languages and libraries • Download our code! https://github.com/forresti/convolution • Use/modify it for your language/library/application Forrest Iandola forresti@eecs.berkeley.edu

More Related