1.22k likes | 1.24k Views
This project aims to accelerate FFT using GPUs to replace ASIC hardware, crucial for diverse disciplines benefiting from channelization.
E N D
Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, VigneshAdhinarayanan§, Wu-chunFeng*§ * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech
Forecast • Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) • Replace fixed hardware ASICs with programmable GPUs Accelerating Fast Fourier Transform for Wideband Channelization
Forecast • Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) • Replace fixed hardware ASICs with programmable GPUs http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Motivation • FFT is a critical building blockacross many disciplines http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization
Introduction • Wideband Channelization • Purpose: To isolate channels within a wideband signal Accelerating Fast Fourier Transform for Wideband Channelization
Introduction • Wideband Channelization • Purpose: To isolate channels within a wideband signal Accelerating Fast Fourier Transform for Wideband Channelization
Introduction • Wideband Channelization • Purpose: To isolate channels within a wideband signal http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html Accelerating Fast Fourier Transform for Wideband Channelization
Introduction • Wideband Channelization • Purpose: To isolate channels within a wideband signal Figure: Stages in a PFB Channelizer http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelization) • Algorithm: Polyphase filter bank (PFB) channelizer Figure: Stages in a PFB Channelizer Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelization) • Algorithm: Polyphase filter bank (PFB) channelizer • Problem: FFT stage grows fastest in channelization Figure: Stages in a PFB Channelizer Accelerating Fast Fourier Transform for Wideband Channelization
Introduction (Channelization) • Algorithm: Polyphase filter bank (PFB) channelizer • Problem: FFT stage grows fastest in channelization Figure: Stages in a PFB Channelizer Accelerating Fast Fourier Transform for Wideband Channelization
Choosing the Right Processor • Criteria:Programmability & Performance Accelerating Fast Fourier Transform for Wideband Channelization
Choosing the Right Processor • Criteria:Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor • Criteria:Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor • Criteria:Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor • Criteria:Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor • Criteria:Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor • Criteria:Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Outline • Motivation • Introduction • Background • Approach • System-level optimizations • Algorithm-level optimizations • Results • Optimizations in isolation • Optimizations in concert • Conclusion Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy • Global Memory Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy • Global Memory Table: Memory Read Bandwidth for Radeon HD 6970 Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy • Global Memory • Image Memory Table: Memory Read Bandwidth for Radeon HD 6970 Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy • Global Memory • Image Memory • Constant Memory Table: Memory Read Bandwidth for Radeon HD 6970 Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy • Global Memory • Image Memory • Constant Memory • Local Memory Table: Memory Read Bandwidth for Radeon HD 6970 Accelerating Fast Fourier Transform for Wideband Channelization
Background (GPUs) • GPU Memory Hierarchy • Global Memory • Image Memory • Constant Memory • Local Memory • Registers Table: Memory Read Bandwidth for Radeon HD 6970 Accelerating Fast Fourier Transform for Wideband Channelization
Outline • Motivation • Introduction • Background • Approach • System-level optimizations • Algorithm-level optimizations • Results • Optimizations in isolation • Optimizations in concert • Conclusion Accelerating Fast Fourier Transform for Wideband Channelization
Approach • Act as the “human compiler” Accelerating Fast Fourier Transform for Wideband Channelization
Approach • Act as the “human compiler” • Derive a candidate set of optimizations for FFT on GPUs Candidate Optimizations Accelerating Fast Fourier Transform for Wideband Channelization
Approach • Act as the “human compiler” • Derive a candidate set of optimizations for FFT on GPUs • Apply optimizations in isolation Optimizations in Isolation Candidate Optimizations Accelerating Fast Fourier Transform for Wideband Channelization
Approach • Act as the “human compiler” • Derive a candidate set of optimizations for FFT on GPUs • Apply optimizations in isolation • Apply optimizations in concert Optimizations in Isolation Candidate Optimizations Optimizations in Concert Accelerating Fast Fourier Transform for Wideband Channelization
Approach • System-level Optimizations (applicable to any application) • Register Preloading • Vector Access/{Vector,Scalar} Arithmetic • Constant Memory Usage • Dynamic Instruction Reduction • Memory Coalescing • Image Memory • Algorithm-level Optimizations • Transpose via LM • Compute/Transpose via LM • Compute/No Transpose via LM C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization
Approach • System-level Optimizations (applicable to any application) • Register Preloading • Vector Access/{Vector,Scalar} Arithmetic • Constant Memory Usage • Dynamic Instruction Reduction • Memory Coalescing • Image Memory • Algorithm-level Optimizations • Transpose via LM • Compute/Transpose via LM • Compute/No Transpose via LM C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization
Approach • System-level Optimizations (applicable to any application) • Register Preloading • Vector Access/{Vector,Scalar} Arithmetic • Constant Memory Usage • Dynamic Instruction Reduction • Memory Coalescing • Image Memory C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization
Approach • System-level Optimizations (applicable to any application) • Register Preloading • Vector Access/{Vector,Scalar} Arithmetic • Constant Memory Usage • Dynamic Instruction Reduction • Memory Coalescing • Image Memory • Algorithm-level Optimizations • Naïve Transpose (LM-CM) • Compute/Transpose via LM (LM-CC) • Compute/No Transpose via LM (LM-CT) C. del Mundoet al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Register Preloading (RP) • Load to registers first Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Register Preloading (RP) • Load to registers first Without Register Preloading 79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Register Preloading (RP) • Load to registers first Without Register Preloading 79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void optimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 __private float2 r0, r1, r2, r3;// Register Declaration85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3); Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Vector Access(float{2, 4, 8, 16}) Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Vector Access(float{2, 4, 8, 16}) a[0] Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Vector Access(float{2, 4, 8, 16}) a[0] a[1] Accelerating Fast Fourier Transform for Wideband Channelization
System-level Optimizations • Vector Access(float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Accelerating Fast Fourier Transform for Wideband Channelization