200 likes | 869 Views
CUDA ITK. Won-Ki Jeong SCI Institute University of Utah. NVIDIA G80. New architecture for computing on the GPU GPU as massively parallel multithreaded machine One step further from streaming model New hardware features Unified shaders (ALUs) Flexible memory access (scatter)
E N D
CUDA ITK Won-Ki Jeong SCI Institute University of Utah
NVIDIA G80 • New architecture for computing on the GPU • GPU as massively parallel multithreaded machine • One step further from streaming model • New hardware features • Unified shaders (ALUs) • Flexible memory access (scatter) • Fast user-controllable on-chip memory • Integer, bitwise operations
NVIDIA CUDA • C-extension NVIDIA GPU programming language • No graphics API overhead • Easy to learn • Support development tools • Extensions / API • Function type : __global__, __device__, __host__ • Variable type : __shared__, __constant__ • cudaMalloc(), cudaFree(), cudaMemcpy(),… • __syncthread(), atomicAdd(),… • Program types • Device program (kernel) : run on the GPU • Host program : run on the CPU to call device programs
CUDA ITK • ITK powered by CUDA • Many registration / image processing functions are still computationally expensive and parallelizable • Current ITK parallelization is bound by # of CPUs (cores) • Our approach • Implement several well-known ITK image filters using NVIDIA CUDA • Focus on 3D volume processing • CT / MRI datasets are mostly 3D volume
CUDA ITK • CUDA code is integrated into ITK • Transparent to the itk users • No need to modify current code using ITK • Check environment variable ITK_CUDA • Entry point : GenerateData() or ThreadedGenerateData() • If ITK_CUDA == 0 • Execute original ITK code • If ITK_CUDA == 1 • Execute CUDA code
ITK image space filters • Convolution filters • Mean filter • Gaussian filter • Derivative filter • Hessian of Gaussian filter • Statistical filter • Median filter • PDE-based filter • Anisotropic diffusion filter
Speed up using CUDA • Mean filter : ~ 140x • Median filter : ~ 25x • Gaussian filter : ~ 60x • Anisotropic diffusion : ~ 70x
Convolution filters • Separable filter • N-dimensional convolution = N*1D convolution • For filter radius r, • Example • 2D Gaussian = 2 * 1D Gaussian
Input (global memory) Output (global memory) kernel * Shared memory GPU implementation • Apply 1D convolution along each axis • Minimize overlapping
1 2 1 2 4 2 1 1 2 Multiple overlapping No overlapping Minimize overlapping • Usually kernel width is large ( > 20 for Gaussian) • Max block size ~ 8x8x8 • Each pixel has 6 neighbors in 3D • Use long and thin blocks to minimize overlapping 1 1 1 1
1 2 0 3 4 1 1 2 8 1 3 4 1 0 1 2 8 1 0 1 4 3 1 8 2 8 1 0 1 4 3 1 16 4 5 11 Median filter • Viola et al. [VIS 03] • Finding median by bisection of histogram bins • Log(# bins) iterations (e.g., 8-bit pixel : 8 iterations) Intensity : 0 1 2 3 4 5 6 7 1. 3. 2. 4.
Pseudo code (GPU median filter) Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count < kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot);
Perona & Malik anisotropic PDE • Nonlinear diffusion • Fall-off function c (conductance)controls anisotropy • Less smoothing across high gradient • Contrast parameterk • Numerical solution • Euler explicit integration (iterative method) • Finite difference for derivative computation
Gradient & Conductance map • Half x / y / z direction gradients / conductance for each pixel • 2D example • For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory required Shared memory Global memory n*n (n+2)*(n+2) (n+1)*(n+1) * 4 (grad x, grad y, cond x, cond y)
Euler integration • Use pre-computed gradients and conductance • Each gradient / conductance is used twice • Avoid redundant computation by using pre-computed gradient / conductance map
Experiments • Test environment • CPU : AMD Opteron Dual Core 1.8GHz • GPU : Tesla C870 • Input volume is 128^3
Result • Mean filter • Gaussian filter
Result • Median filter • Anisotropic diffusion
Summary • ITK powered by CUDA • Image space filters using CUDA • Up to 140x speed up • Future work • GPU image class for ITK • Reduce CPU to GPU memory I/O • Pipelining support • Image registration • Numerical library (vnl) • Out-of-GPU-core processing • Seismic volumes (~10s to 100s GB)