CUDA ITK

CUDA ITK Won-Ki Jeong SCI Institute University of Utah

NVIDIA G80 • New architecture for computing on the GPU • GPU as massively parallel multithreaded machine • One step further from streaming model • New hardware features • Unified shaders (ALUs) • Flexible memory access (scatter) • Fast user-controllable on-chip memory • Integer, bitwise operations

NVIDIA CUDA • C-extension NVIDIA GPU programming language • No graphics API overhead • Easy to learn • Support development tools • Extensions / API • Function type : __global__, __device__, __host__ • Variable type : __shared__, __constant__ • cudaMalloc(), cudaFree(), cudaMemcpy(),… • __syncthread(), atomicAdd(),… • Program types • Device program (kernel) : run on the GPU • Host program : run on the CPU to call device programs

CUDA ITK • ITK powered by CUDA • Many registration / image processing functions are still computationally expensive and parallelizable • Current ITK parallelization is bound by # of CPUs (cores) • Our approach • Implement several well-known ITK image filters using NVIDIA CUDA • Focus on 3D volume processing • CT / MRI datasets are mostly 3D volume

CUDA ITK • CUDA code is integrated into ITK • Transparent to the itk users • No need to modify current code using ITK • Check environment variable ITK_CUDA • Entry point : GenerateData() or ThreadedGenerateData() • If ITK_CUDA == 0 • Execute original ITK code • If ITK_CUDA == 1 • Execute CUDA code

ITK image space filters • Convolution filters • Mean filter • Gaussian filter • Derivative filter • Hessian of Gaussian filter • Statistical filter • Median filter • PDE-based filter • Anisotropic diffusion filter

Speed up using CUDA • Mean filter : ~ 140x • Median filter : ~ 25x • Gaussian filter : ~ 60x • Anisotropic diffusion : ~ 70x

Convolution filters • Separable filter • N-dimensional convolution = N*1D convolution • For filter radius r, • Example • 2D Gaussian = 2 * 1D Gaussian

Input (global memory) Output (global memory) kernel * Shared memory GPU implementation • Apply 1D convolution along each axis • Minimize overlapping

1 2 1 2 4 2 1 1 2 Multiple overlapping No overlapping Minimize overlapping • Usually kernel width is large ( > 20 for Gaussian) • Max block size ~ 8x8x8 • Each pixel has 6 neighbors in 3D • Use long and thin blocks to minimize overlapping 1 1 1 1

1 2 0 3 4 1 1 2 8 1 3 4 1 0 1 2 8 1 0 1 4 3 1 8 2 8 1 0 1 4 3 1 16 4 5 11 Median filter • Viola et al. [VIS 03] • Finding median by bisection of histogram bins • Log(# bins) iterations (e.g., 8-bit pixel : 8 iterations) Intensity : 0 1 2 3 4 5 6 7 1. 3. 2. 4.

Pseudo code (GPU median filter) Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count < kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot);

Perona & Malik anisotropic PDE • Nonlinear diffusion • Fall-off function c (conductance)controls anisotropy • Less smoothing across high gradient • Contrast parameterk • Numerical solution • Euler explicit integration (iterative method) • Finite difference for derivative computation

Gradient & Conductance map • Half x / y / z direction gradients / conductance for each pixel • 2D example • For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory required Shared memory Global memory n*n (n+2)*(n+2) (n+1)*(n+1) * 4 (grad x, grad y, cond x, cond y)

Euler integration • Use pre-computed gradients and conductance • Each gradient / conductance is used twice • Avoid redundant computation by using pre-computed gradient / conductance map

Experiments • Test environment • CPU : AMD Opteron Dual Core 1.8GHz • GPU : Tesla C870 • Input volume is 128^3

Result • Mean filter • Gaussian filter

Result • Median filter • Anisotropic diffusion

Summary • ITK powered by CUDA • Image space filters using CUDA • Up to 140x speed up • Future work • GPU image class for ITK • Reduce CPU to GPU memory I/O • Pipelining support • Image registration • Numerical library (vnl) • Out-of-GPU-core processing • Seismic volumes (~10s to 100s GB)

Questions?

CUDA ITK

CUDA ITK

Presentation Transcript

ITK Workshop

Cuda

CUDA

CUDA

CUDA

ITK-Overview

CUDA

CUDA

CUDA

ITK Applications

ITK Lecture 4 Images in ITK

ITK Workshop

ITK

ITK Workshop

ITK Workshop

ITK Examples