170 likes | 209 Views
Implementation of Voxel Volume Projection Operators Using CUDA. Applications to Iterative Cone-Beam CT reconstruction. Three questions to keep in mind during the presentation. Name three differences between the GPU and CPU contributing to the observed performance improvements.
E N D
Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction
Three questions to keep in mind during the presentation • Name three differences between the GPU and CPU contributing to the observed performance improvements. • Scattered accumulation (a[ix] = a[ix] + b) can result in reduced performance and unexpected results in CUDA. Why? Please state two possible problems! • Suppose that each thread in a CUDA program is responsible for writing to one unique output element. Give one possible solution for handling ”odd sized” output data.
What are voxel volume projectors? Detector • Object: f • Projection data: p (line integrals through f) • Forward projection: P, mapping f to p • Backprojection: PT, mapping p to f p f X-ray source
The tomography problem • Given p, find an image/volume f, such that the difference between Pf and p is small in some sense, e.g. z(f)=||Pf-p||2 attains a minimum. • Steepest descent approach: • Gradient: z(f)=2PT(Pf-p) • Update step: fk+1=fk-2aPT(Pfk-p) • We use this method only for illustration of why P and PT are needed. In practice, faster and better regularized methods are needed.
Why implement P and PT on the GPU? • Reasonable sizes of f and p are around 2GB, implying a size of 2GBx2GB for P and PT. • Although these matrices are sparse, the number of nonzero elements is approximately 2000GB, i.e., matrix-vector products involving these matrices are computationally demanding. • The GPU offers many solutions that help speeding up the computations: • Massive parallelism • Hardware linear interpolation • Very high memory bandwidth • Texture caches ”optimized for 2D locality”
The Joseph forward projector [1] • Illustration in the 2D case: • Generalization to 3D is straightforward. Instead of linear intepolation, bilinear interpolation is used. pi= L (x0+x1+x2+x3) x3 x2 x1 x0 L [1] Joseph, P. M., An improved algorithm for reprojecting rays through pixel images, IEEE Transactions on Medical Imaging, 1982, MI-1, 192-196
Implementation sketch for P • We choose to let one thread correspond to one ray. • For each ray/thread: • Determine the ray origin and direction. • Determine the ray and voxel volume intersection. • Step along the ray and accumulate values from the voxel volume, using the 3D-texture cache. • Multiply with L and store to output element.
Handling of ”odd” detector sizes • The x-ray detector is divided into 2D blocks corresponding to CUDA thread blocks, using a static block size (16x2). • To handle the case with detector dimensions not divisible with the block size, conditional statements were used: • Although this reduces efficiency due to divergent programs, the reduction is small for detectors with reasonably large number of channels and rows. // Return if outside detector if (rowIx>=Nrows) return; if (chIx>=Nch) return;
Implementation details 1Calc. of source and detector positions // Calculate focus center and displacement vectors float alpha = theta + D_ALPHA(chIx); float3 fc = make_float3(Rf*cos(alpha), Rf*sin(alpha), z0+D_Z(chIx)); float3 fw = make_float3(-sin(alpha), cos(alpha), 0.f); float3 fh = make_float3(-cos(alpha)*sin(aAngle), -sin(alpha)*sin(aAngle), cos(aAngle)); // Calculate detector center and displacement vectors float3 dc = fc + (Rf+Rd) * make_float3(-cos(theta), -sin(theta), D_SLOPE(rowIx)); float3 dw = make_float3(sin(theta), -cos(theta), 0.f); float3 dh = make_float3(0.f, 0.f, 1.f); // Calculate focus position in texture index coordinates float3 f = ((fc + fOffset.x*fw + fOffset.y*fh) - origin) / spacing; // Calculate detector position in texture index coordinates float3 d = ((dc + dOffset.x*dw + dOffset.y*dh) - origin) / spacing; // Create ray struct Ray ray; ray.o = f; ray.d = normalize(d - f); Code based on NVIDIA CUDA SDK ”volumeRender”
Implementation details 2Accumulation of contributions // Accumulate contributions along ray float dValue = 0; float t = tnear; for(int i=0; i<dimensions.x; i++) { // Calculate current position float3 pos = ray.o + ray.d*t + 0.5f; // texture origin: (0.5f, 0.5f, 0.5f) // read from 3D texture dValue += tex3D(devT_reconVolume, pos.x, pos.y, pos.z); // increase t t += tstep; if (t >= tfar) break; } // Update detector data value dValue *= length((ray.d*spacing))*tstep; projectionData[rowIx*Nch+chIx] += dValue; Code based on NVIDIA CUDA SDK ”volumeRender”
Experiments – forward projection (P) • Input data dimensions: 672x24x3000 floats • Output data dimensions: 512x512x257 floats • Only very small differences in accuracy, without practical implications, occur between CPU and GPU implementations. • Calculation times • CPU: 2500 s • GPU: 47 s • Speed up factor: approximately 50x • For a larger collection of problems, speedups between 20x and 50x have been observed.
Efficient implementation of the exact adjoint operator PT is much trickier, why? • 2D illustration of the procedure: • Same interpolation coefficients as for P, but with scattered accumulation instead of reading. • No hardware interpolation. No 2D/3D textures. • New parallelization setup is needed. One ray/one thread leads to corrupted results. pi
One slow but exact implementation for the exact transpose of P • Let one thread represent only one accumulation from a ray, i.e., • calculation of position in voxel volume. • calculation and multiplication with interpolation coefficients. • accumulation to the 4 voxels closest to the active ray/voxel volume plane intersection. • Very short, and very many rays threads. • One thread block represents a number of rays, separated so that conflicting updates do not occur: Block 0 Block 1 Block 2 ...
Experiments – backprojection (PT) • Input data dimensions: 512x512x257 floats • Output data dimensions: 672x24x3000 floats • Calculation times • CPU: 2700 s • GPU: 700 s • Speed up factor: approximately 4x • For a larger collection of problems, speedups between 4x and 8x have been observed.
Approximate implementation of the adjoint operator PT • Bilinear interpolation on the detector. • 2D-illustration: • Parallelization is now accomplished by letting one pixel correspond to one thread. • This method is approximate since the interpolation coefficients generally are different from those of PT.
Performance comparisonApproximate versus exact PT operator • Calculation times • CPU PT : 2700s • GPU PT (exact): 700s • GPU PT (approximate): 60s • Axial images of Head phantom reconstructions (25HU-75HU): Exact Approximate
Conclusions and future research • For operations such as P and PT, speedups of 4x to 50x can be obtained. • The amount of speedup is highly dependent on • The possibility to efficiently read memory (coalesced or by the use of texture caches / constant memory). • The possibility to use hardware interpolation • Complexity of the kernel. Using too many registers slows down the program. Lookup tables in constant memory can help reducing the amount of registers. • Although scattered write is supported by CUDA, for these problemes, it did not seem good to use it from a performance point of view. Note that this prohibits using cudaArray textures. • Remark: Even if the approximate PT operator give a bad result in the example given here, there are other situations where the exact operator is superior. It is therefore of interest to find better approximations.