330 likes | 474 Views
Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture. Student: Carlo C. del Mundo * , Virginia Tech (Undergrad) Advisor: Dr. Wu-chun Feng * § , Virginia Tech
E N D
Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture Student: Carlo C. del Mundo*, Virginia Tech (Undergrad) Advisor: Dr. Wu-chun Feng*§, Virginia Tech * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech
Forecast: Hardware-Software Co-Design Software (Transpose) Hardware (K20c and shuffle) Shuffle Mechanism NVIDIA Kepler K20c Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle? Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle? Cheaper data movement Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle? Cheaper data movement • Faster than shared memory Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle? Cheaper data movement • Faster than shared memory • Only in NVIDIA Tesla Kepler GPUs Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle? Cheaper data movement • Faster than shared memory • Only in NVIDIA Tesla Kepler GPUs • Limited to a warp Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle? Cheaper data movement • Faster than shared memory • Only in NVIDIA Tesla Kepler GPUs • Limited to a warp >>> Idea: reduce data communication between threads <<< Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving? Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving? • Enable efficient data communication Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving? • Enable efficient data communication • Shared Memory (the “old” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving? • Enable efficient data communication • Shared Memory (the “old” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving? • Enable efficient data communication • Shared Memory (the “old” way) • Shuffle (the “new” way) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Approach • Evaluate shuffle using matrix transpose • Matrix transpose is a data communication step in FFT Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Approach • Evaluate shuffle using matrix transpose • Matrix transpose is a data communication step in FFT Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Approach • Evaluate shuffle using matrix transpose • Matrix transpose is a data communication step in FFT • Devised Shuffle Transpose Algorithm • Consists of horizontal (inter-thread shuffles) and vertical (intra-thread) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis • Bottleneck: Intra-thread data movement Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File for(int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File for(int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis • Bottleneck: Intra-thread data movement t0 t1 t2 t3 Stage 2: Vertical Register File for(int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 1: (NAIVE) 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 2 (DIV) int tmp = src_registers[0]; if(tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } elseif (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } elseif (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 2 (DIV) int tmp = src_registers[0]; if(tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } elseif (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } elseif (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } 6% Divergence Divergence Divergence 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis General strategies • Registers are fast. • CUDA local memory is slow. • Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time. Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Code 2 (DIV) int tmp = src_registers[0]; if(tid == 1) { src_registers[0] = src_registers[3]; src_registers[3] = src_registers[2]; src_registers[2] = src_registers[1]; src_registers[1] = tmp; } elseif (tid == 2) { src_registers[0] = src_registers[2]; src_registers[2] = tmp; tmp = src_registers[1]; src_registers[1] = src_registers[3]; src_registers[3] = tmp; } elseif (tid == 3) { src_registers[0] = src_registers[1]; src_registers[1] = src_registers[2]; src_registers[2] = src_registers[3]; src_registers[3] = tmp; } 6% Divergence Code3 (SELP OOP) 65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3]; 44% Divergence Divergence 15x Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Results Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Results Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Conclusion • Overall Performance • Max. Speedup (Amdahl’s Law): 1.19-fold • Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Conclusion • Overall Performance • Max. Speedup (Amdahl’s Law): 1.19-fold • Achieved Speedup: 1.17-fold • Surprise Result • Goal: Accelerate communication (“gray bar”) • Result: Accelerated the computation also (“black bar”) Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Thank You! • Enabling Efficient Intra-Warp Comunication for Fourier Transforms in a Many-Core Architecture • Student: Carlo del Mundo, Virginia Tech (undergrad) • Overall Performance • Theoretical Speedup: 1.19-fold • Achieved Speedup: 1.17-fold Code 1 (NAIVE) 63 for (int k = 0; k < 4; ++k) • dst_registers[k] = src_registers[(4 - tid + k) % 4]; Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Appendix Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Motivation • Goal • Accelerating an application based on hardware-specific mechanisms (e.g., “the hardware-software co-design process”) • Case Study • Application: Matrix transpose as part of a 256-pt FFT • Architecture: NVIDIA Kepler K20c • Use shuffle to accelerate communication • Results • Max. Theoretical Speedup: 1.19-fold • Achieved Speedup: 1.17-fold Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Background: The New and Old • Shuffle • Idea: • Communicate data within a warp w/o shared memory • Pros • Faster (1 cycle to perform load and store) • Eliminate the use of shared memory higher thread occupancy • Cons • Poorly understood • Only available in Kepler GPUs • Only limited to 32 threads • Shared Memory • Idea • Scratchpad memory to communicate data • Pros • Easy to program • Scales to a block (up to 1536 threads) • Cons • Prone to bank conflicts Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture