Programming Massively Parallel Processors Lecture Slides for Chapter 8: Application Case Study – Advanced MRI Reconstr

Programming Massively Parallel ProcessorsLecture Slides for Chapter 8: Application Case Study –Advanced MRI Reconstruction © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Objective • To learn about computational thinking skills through a concrete example • Problem formulation • Designing implementations to steer around limitations • Validating results • Understanding the impact of your improvements • A top to bottom experience! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Acknowledgements Sam S. Stone§, Haoran Yi§, Justin P. Haldar†, Deepthi Nandakumar, Bradley P. Sutton†, Zhi-Pei Liang†, Keith Thulburin* §Center for Reliable and High-Performance Computing †Beckman Institute for Advanced Science and Technology Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign * University of Illinois, Chicago Medical Center © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Overview • Magnetic resonance imaging • Non-Cartesian Scanner Trajectory • Least-squares (LS) reconstruction algorithm • Optimizing the LS reconstruction on the G80 • Overcoming bottlenecks • Performance tuning • Summary © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Reconstructing MR Images Cartesian Scan Data Spiral Scan Data Gridding FFT LS Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 2

Reconstructing MR Images Cartesian Scan Data Spiral Scan Data Gridding1 FFT LS Spiral scan data + Gridding + FFT: Fast scan, fast reconstruction, better images 1 Based on Fig 1 of Lustig et al, Fast Spiral Fourier Transform for Iterative MR Image Reconstruction, IEEE Int’l Symp. on Biomedical Imaging, 2004 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 2

Reconstructing MR Images Cartesian Scan Data Spiral Scan Data Gridding FFT Least-Squares (LS) Spiral scan data + LS Superior images at expense of significantly more computation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 2

An Exciting Revolution - Sodium Map of the Brain Images of sodium in the brain Very large number of samples for increased SNR Requires high-quality reconstruction Enables study of brain-cell viability before anatomic changes occur in stroke and cancer treatment – within days! Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Least-Squares Reconstruction • Q depends only on scanner configuration • FHd depends on scan data • ρ found using linear solver • Accelerate Q and FHd on G80 • Q: 1-2 days on CPU • FHd: 6-7 hours on CPU • ρ: 1.5 minutes on CPU Compute Q for FHF Acquire Data Compute FHd Find ρ © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

for (m = 0; m < M; m++) { phiMag[m] = rPhi[m]*rPhi[m] + iPhi[m]*iPhi[m]; for (n = 0; n < N; n++) { expQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); rQ[n] +=phiMag[m]*cos(expQ); iQ[n] +=phiMag[m]*sin(expQ); } } (a) Q computation for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (b) FHd computation Q v.s. FHD © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Algorithms to Accelerate for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } • Scan data • M = # scan points • kx, ky, kz = 3D scan data • Pixel data • N = # pixels • x, y, z = input 3D pixel data • rFhD, iFhD= output pixel data • Complexity is O(MN) • Inner loop • 13 FP MUL or ADD ops • 2 FP trig ops • 12 loads, 2 stores © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

From C to CUDA: Step 1What unit of work is assigned to each thread? for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

One Possibility __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

One Possibility - Improved __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float rMu_reg, iMu_reg; rMu_reg = rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu_reg = iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu_reg*cArg – iMu_reg*sArg; iFhD[n] += iMu_reg*cArg + rMu_reg*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Back to the Drawing Board – Maybe map the n loop to threads? for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (a) FHd computation for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (b) after code motion © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

A Second Option for the cmpFHd Kernel __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; for (m = 0; m < M; m++) { float rMu_reg =rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; float iMu_reg = iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhD[n] += rMu_reg*cArg – iMu_reg*sArg; iFhD[n] += iMu_reg*cArg + rMu_reg*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 6

We do have another option. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (a) FHd computation for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; } for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (b) after loop fission © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

A Separate cmpMu Kernel __global__ void cmpMu(float* rPhi, iPhi, rD, iD, rMu, iMu) { int m = blockIdx.x * MU_THREAEDS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

__global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; for (n = 0; n < N; n++) { float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 6

for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (a) before loop interchange for (n = 0; n < N; n++) { for (m = 0; m < M; m++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (b) after loop interchange Figure 7.9 Loop interchange of the FHD computation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

A Third Option of FHd kernel __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; for (n = 0; n < N; n++) { float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 6

Using Registers to Reduce Global Memory Traffic __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx[m]*xn_r+ky[m]*yn_r+kz[m]*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Tiling of Scan Data LS recon uses multiple grids • Each grid operates on all pixels • Each grid operates on a distinct subset of scan data • Each thread in the same grid operates on a distinct pixel Thread n operates on pixel n: for (m = 0; m < M/32; m++) { exQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]) rQ[n] += phi[m]*cos(exQ) iQ[n] += phi[m]*sin(exQ) } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 12

Chunking k-space data to fit into constant memory __constant__ float kx_c[CHUNK_SIZE], ky_c[CHUNK_SIZE], kz_c[CHUNK_SIZE]; … __ void main() { int i; for (i = 0; i < M/CHUNK_SIZE; i++); cudaMemcpy(kx_c,&kx[i*CHUNK_SIZE],4*CHUNK_SIZE, cudaMemCpyHostToDevice); cudaMemcpy(ky_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE, cudaMemCpyHostToDevice); cudaMemcpy(ky_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE, cudaMemCpyHostToDevice); … cmpFHD<<<FHD_THREADS_PER_BLOCK, N/FHD_THREADS_PER_BLOCK>>> (rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M); } /* Need to call kernel one more time if M is not */ /* perfect multiple of CHUNK SIZE */ } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Revised Kernel for Constant Memory __global__ void cmpFHd(float* rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx[m]*xn_r+ky[m]*yn_r+kz[m]*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

(a) k-space data stored in separate arrays. (b) k-space data stored in an array whose elements are structs. Figure 7.14 Effect of k-space data layout on constant cache efficiency. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 25

struct kdata { float x, float y, float z; } k; __constant__ struct kdata k_c[CHUNK_SIZE]; … __ void main() { int i; for (i = 0; i < M/CHUNK_SIZE; i++); cudaMemcpy(k_c,k,12*CHUNK_SIZE, cudaMemCpyHostToDevice); cmpFHD<<<FHD_THREADS_PER_BLOCK,N/FHD_THREADS_PER_BLOCK>>> (); } Figure 7.15 adjusting k-space data layout to improve cache efficiency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 6

__global__ void cmpFHd(float* rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(k[m].x*xn_r+k[m].y*yn_r+k[m].z*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } Figure 7.16 Adjusting the k-space data memory layout in the FHd kernel © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 6

From C to CUDA: Step 2Where are the potential bottlenecks? Bottlenecks • Memory BW • Trig ops • Overheads (branches, addr calcs) Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi, int startM,endM) { n = blockIdx.x*TPB + threadIdx.x for (m = startM; m < endM; m++) { exQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]) rQ[n] += phi[m] * cos(exQ) iQ[n] += phi[m] * sin(exQ) } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 13

Step 3: Overcoming bottlenecks • LS recon on CPU (SP) • Q: 45 hours, 0.5 GFLOPS • FHd: 7 hours, 0.7 GFLOPS • Counting each trig op as 1 FLOP © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 14

Step 3: Overcoming Bottlenecks (Mem BW) • Register allocate pixel data • Inputs (x, y, z); Outputs (rQ, iQ) • Exploit temporal and spatial locality in access to scan data • Constant memory + constant caches • Shared memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 15

Step 3: Overcoming Bottlenecks (Mem BW) • Register allocation of pixel data • Inputs (x, y, z); Outputs (rQ, iQ) • FP arithmetic to off-chip loads: 2 to 1 • Performance • 5.1 GFLOPS (Q), 5.4 GFLOPS (FHd) • Still bottlenecked on memory BW © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 16

Step 3: Overcoming Bottlenecks (Mem BW) • Old bottleneck: off-chip BW • Solution: constant memory • FP arithmetic to off-chip loads: 284 to 1 • Performance • 18.6 GFLOPS (Q), 22.8 GFLOPS (FHd) • New bottleneck: trig operations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 17

Sidebar: Estimating Off-Chip Loads with Const Cache • How can we approximate the number of off-chip loads when using the constant caches? • Given: 128 tpb, 4 blocks per SM, 256 scan points per grid • Assume no evictions due to cache conflicts • 7 accesses to global memory per thread (x, y, z, rQ x 2, iQ x 2) • 4 blocks/SM * 128 threads/block * 7 accesses/thread = 3,584 global mem accesses • 4 accesses to constant memory per scan point (kx, ky, kz, phi) • 256 scan points * 4 loads/point = 1,024 constant mem accesses • Total off-chip memory accesses = 3,584 + 1,024 = 4,608 • Total FP arithmetic ops = 4 blocks/SM * 128 threads/block * 256 iters/thread * 10 ops/iter = 1,310,720 • FP arithmetic to off-chip loads: 284 to 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 18

Step 3: Overcoming Bottlenecks (Trig) • Old bottleneck: trig operations • Solution: SFUs • Performance • 98.2 GFLOPS (Q), 92.2 GFLOPS (FHd) • New bottleneck: overhead of branches and address calculations © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 19

Sidebar: Effects of Approximations • Avoid temptation to measure only absolute error (I0 – I) • Can be deceptively large or small • Metrics • PSNR: Peak signal-to-noise ratio • SNR: Signal-to-noise ratio • Avoid temptation to consider only the error in the computed value • Some apps are resistant to approximations; others are very sensitive A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995). © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 20

Step 3: Overcoming Bottlenecks (Overheads) • Old bottleneck: Overhead of branches and address calculations • Solution: Loop unrolling and experimental tuning • Performance • 179 GFLOPS (Q), 145 GFLOPS (FHd) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 21

Experimental Tuning: Tradeoffs • In the Q kernel, three parameters are natural candidates for experimental tuning • Loop unrolling factor (1, 2, 4, 8, 16) • Number of threads per block (32, 64, 128, 256, 512) • Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048) • Can’t optimize these parameters independently • Resource sharing among threads (register file, shared memory) • Optimizations that increase a thread’s performance often increase the thread’s resource consumption, reducing the total number of threads that execute in parallel • Optimization space is not linear • Threads are assigned to SMs in large thread blocks • Causes discontinuity and non-linearity in the optimization space © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 22

Experimental Tuning: Example Increase in per-thread performance, but fewer threads: Lower overall performance © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 23

Experimental Tuning: Scan Points Per Grid © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 24

Sidebar: Cache-Conscious Data Layout • kx, ky, kz, and phi components of same scan point have spatial and temporal locality • Prefetching • Caching • Old layout does not fully leverage that locality • New layout does fully leverage that locality © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 25

Experimental Tuning: Scan Points Per Grid (Improved Data Layout) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 26

Experimental Tuning: Loop Unrolling Factor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 27

Sidebar: Optimizing the CPU Implementation • Optimizing the CPU implementation of your application is very important • Often, the transformations that increase performance on CPU also increase performance on GPU (and vice-versa) • The research community won’t take your results seriously if your baseline is crippled • Useful optimizations • Data tiling • SIMD vectorization (SSE) • Fast math libraries (AMD, Intel) • Classical optimizations (loop unrolling, etc) • Intel compiler (icc, icpc) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 28

Summary of Results 8X © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 29

Summary of Results 357X 228X 108X © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 30

Questions? + = Scanner image released distributed under GNU Free Documentation License. GeForce 8800 GTX image obtained from http://www.nvnews.net/previews/geforce_8800_gtx/index.shtml © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 31

Algorithms to Accelerate Compute FHd for (K = 0; K < numK; K++) rRho[K] = rPhi[K]*rD[K] + iPhi[K]*iD[K] iRho[K] = rPhi[K]*iD[K] - iPhi[K]*rD[K] for (X = 0; X < numP; X++) for (K = 0; K < numK; K++) exp = 2*PI*(kx[K]*x[X] + ky[K]*y[X] + kz[K]*z[X]) cArg = cos(exp) sArg = sin(exp) rFH[X] += rRho[K]*cArg – iRho[K]*sArg iFH[X] += iRho[K]*cArg + rRho[K]*sArg • Inner loop • 14 FP MUL or ADD ops • 4 FP trig ops • 12 loads (naively) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Programming Massively Parallel Processors Lecture Slides for Chapter 8: Application Case Study – Advanced MRI Reconstr

Programming Massively Parallel Processors Lecture Slides for Chapter 8: Application Case Study – Advanced MRI Reconstr

Presentation Transcript

Non-Traditional Marketing Case Study: Airport Marketing

Integer Programming, Goal Programming, and Nonlinear Programming

FIELD Research METHODS LECTURE 6: Small-N/Case Studies

Mobile Programming Lecture 7

Introduction to Constraint Programming

Lecture 2 (Mapping Applications to Multi-core Arch)

Advanced C #

Lecture Slides

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Chapter 3

Advanced Topic: High Performance Processors

Parallel Programming on the SGI Origin2000

SHELL PROGRAMMING

CS 591x Clutter Computing and Programming Parallel Computers

Chapter 2 Pipelined Processors

Introduction to Parallel Computing

CSE 3101

STARTING YOUR OWN IT COMPANY

Lecture Slides

CASE STUDY, CASE DISCUSSION AND CASE MANAGEMENT