320 likes | 460 Views
Accelerating MATLAB Image Processing Toolbox Functions on GPUs. Jingfei Kong , Martin Dimitrov , Yi Yang, Janaka Liyanage , Lin Cao, Jacob Staples, Mike Mantor , Huiyang Zhou. Motivation.
E N D
Accelerating MATLAB Image Processing Toolbox Functions on GPUs Jingfei Kong, Martin Dimitrov, Yi Yang, JanakaLiyanage, Lin Cao, Jacob Staples, Mike Mantor, Huiyang Zhou
Motivation • With high memory bandwidth and teraflops computing capability, Graphics Processor Units (GPUs) become quite attractive for accelerating general purpose applications • Developing high-performance GPU programs, however, requires deep understanding of both application algorithms and GPU hardware architecture • A systematic way of dealing with a generic class of applications is missing University of Central Florida
Our Contributions • Compare performance-critical hardware features in different GPUs • Develop high-quality open-source library code for some representative functions in MATLAB™ Image Processing Toolbox (IPT) • https://sites.google.com/site/iptatiproject/ [15] • Reveal insights on efficiently accelerating a wide range of image processing algorithms University of Central Florida
Presentation Outline • Motivation • Our Contributions • Implication of GPU hardware on GPGPU programming • A GPGPU library for IPT functions • categorization and optimization strategies • Case Studies • 2D convolution • dither • Conclusions University of Central Florida
Implication of GPU hardware on GPGPU programming University of Central Florida
Implication of GPU hardware on GPGPU programming University of Central Florida
Implication of GPU hardware on GPGPU programming University of Central Florida
Implication of GPU hardware on GPGPU programming University of Central Florida
Implication of GPU hardware on GPGPU programming University of Central Florida
Implication of GPU hardware on GPGPU programming University of Central Florida
Summary of the LibraryMATLAB Image Processing Toolbox (IPT) Function Classification University of Central Florida
MATLAB IPT Function Classification and Optimization Strategies • Characteristics: straightforward one on one mapping, abundant parallelism • Strategies: effectively utilize bandwidth by packing multiple pixels, perform multiple such light-weight tasks if possible to amortize the CPU-GPU data transfer overhead University of Central Florida
MATLAB IPT Function Classification and Optimization Strategies • Characteristics: still one on one mapping, but there is an overlapping over input pixels for computing adjacent output pixel • Strategies: data reuse, computation reuse University of Central Florida
MATLAB IPT Function Classification and Optimization Strategies • Characteristics: lack of explicit parallelism • Strategies: re-think algorithms, explore inherent parallelism University of Central Florida
MATLAB IPT Function Classification and Optimization Strategies • Characteristics: lack of explicit parallelism, sequential nature with data dependency and fine-grain communication requirements • Strategies: give it a shot and you might have some surprise University of Central Florida
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded) University of Central Florida
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded) University of Central Florida
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded) University of Central Florida
Summary of the LibraryPerformance Comparison against MATLAB CPU (single-threaded) University of Central Florida
2D Convolution Overview input pixels 3 x 3 filter output pixels 1 1 2 1 2 3 4 1 1 5 6 1 55 7 2 1 8 1 9 University of Central Florida
2D Convolution Overview • Drag the filter over the each pixel of the source image and multiply and accumulate the overlapped input elements to generate an output pixel. filter pixel Input Image University of Central Florida
2D Convolution: Intra-Thread Data Reuse Thread i • Each thread computes multiple pixels along the column • Intra-Thread reuse: • For a 7x7 filter we reuse each input pixel up to 7 times Thread i Thread i Input Image University of Central Florida 22
2D Convolution: Inter-Thread Data Reuse threads 1 2 3 0 • Threads in the same warp/wavefront access the same row. • Inter-thread reuse • The row is fetched into texture cache/shared memory and reused by different threads on subsequent accesses. Reused row in texture cache/shared memory Input Image University of Central Florida
2D Convolution Performance A 4096 x 4096 image with a 7 x 7 filter • Jacket ‘s: • around 20 GFLOPS on GTX 280 • Jacket 1.2.2 trial version (released on 1/4/2010) from Accelereyes® • Ours: • around 350 GFLOPS on GTX 280 • around 733 GFLOPS on HD 5870 University of Central Florida
Data Dependent Case Study: Dither University of Central Florida
Dither input pixels output pixels Error = 230 – 128 = 102 230 < 128? error 230 0/1? 1 University of Central Florida
Dither – Data Dependency i+j i j pixel at (i, j) University of Central Florida
... 2 5 7 1 3 4 6 8 4 7 9 3 5 6 8 10 6 9 11 5 7 8 10 12 8 11 13 7 9 10 12 14 10 13 15 9 11 12 14 16 12 15 17 11 13 14 16 18 14 17 19 13 15 16 18 20 15 16 17 18 19 20 21 22 From P. Metaxas [8] Dither – Parallel Processing Schedule ... University of Central Florida
1 3 2 4 5 4 5 7 6 8 7 8 9 10 11 10 11 13 12 14 Dither – Our GPU Implementation 1 3 4 2 5 4 5 A relatively small amount of thread blocks/threads are active at any given time • low resource utilization • synchronization overhead (among thread blocks/threads) We still get up to 10.3x kernel speedup and 3.5x overall speedup! University of Central Florida
Conclusions • We identify performance-critical hardware features for GPGPU programs • We present our experience and optimization strategies in developing high performance GPU code for functions from MATLAB Image Processing Toolbox University of Central Florida
Our Open-source Library Project Website https://sites.google.com/site/iptatiproject/ [15] You are more than welcome to contribute! Thank you and Questions? University of Central Florida