150 likes | 268 Views
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology. Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer Engineering University of Wisconsin, Madison. Motivation.
E N D
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer Engineering University of Wisconsin, Madison
Motivation • CT Slice Image Reconstruction is a very important part which will affect the reconstructed image quality and scanning speed • CT Slice Image Reconstruction is very time-consuming • Traditional methods for speedup: • Specially designed hardware • Parallel algorithm running on super computer • Explore a new method: SIMD implementation
Parallel-Beam FBP Image Reconstruction Algorithm • The Algorithm consists on three parts: • data rebinning: • data filtering • back-projection
Parallel-Beam FBP Image Reconstruction Algorithm • Projection: • Data Rebinning: • Data Filtering: • Data Backprojection:
CT Slice Image Reconstruction Is Very Time Consuming A Whole Head Spiral Scanning will generate several GB projection data
Can FBP Algorithm Benefit from SIMD? • The Algorithm has the following features: • Small, highly repetitive loops that operate on sequential arrays of integers and floating-point values • Frequent multiplies and accumulates • Computation-intensive algorithms • Inherently parallel operations • Wide dynamic range, hence floating-point based • Regular memory access patterns • Data independent control flow
Analysis of Data Dynamic Range and Quantization Errors • Wide Dynamic Range • Relative Error Metric • 32-Bit Single-Precision Floating Point and SSE2
Updated Algorithm to Fit SIMD • Update the algorithm to eliminate some conditional branches • Reduce the on-the-fly calculations which are not suitable for the SIMD implementation
A4 A5 A6 A7 A0 A1 A2 A3 A0*B0+A4*B4 B4 A1*B1+A5*B5 B5 B6 A2*B2+A6*B6 B7 A3*B3+A7*B7 B0 B1 B2 B3 Parallel Implementation of Data Filtering In SIMD Rebinned Data * * * * * * * * Weight Filtered Data
E0 C0 F0 B0 G0 -0.5 D0 H0 A0 +0.5 H1 E1 A1 B1 F1 C1 -0.5 G1 +0.5 D1 B2 A2 +0.5 -0.5 D2 H2 C2 E2 G2 F2 D3 H3 A3 +0.5 G3 -0.5 E3 B3 C3 F3 Parallel Implementation of Backprojection in SIMD Index Calculation Index Floor (index) Ceil (index) (fetch data) Filtered Data Weight Reconstructed Image
Optimization of The Implementation • Optimize Memory Access • Ensure proper alignment to prevent data split across cache line boundary: data alignment, stack alignment, code alignment • Observe store-forwarding constraints • Optimize data structure layout and data locality to ensure efficient use of 64-byte cache line size and also reduce the frequency of memory loading and storing • Use prefetching cacheability instructions control appropriately • Minimize bus latency by segmenting the reads and writes into phases • Replace Branches with Logic Operations • Optimize Instruction Scheduling • Optimize the Parallelism • Loop Unrolling • Break dependence chains
Optimization of The Implementation • Optimize Instruction Selection • avoid longer latency instruction • avoid instructions that unnecessarily introduce dependence-related stalls • Optimize the Floating-point Performance • avoid exceeding the representable range • avoid change floating-point control/status register • enable flush-to-zero and DAZ mode
Improvement of Performance The differences of the reconstructed image pixel values between C implementation and SIMD implementation are less than 0.01