140 likes | 159 Views
BOSTON. UNIVERSITY. Three-Dimensional Template Correlation: Object Recognition in 3D Voxel Data. Tom VanCourt Boston University Yongfeng Gu ECE Department Martin Herbordt CAAD lab www.bu.edu/caadlab. 3D Template Matching. Increasing use of volumetric data sets
E N D
BOSTON UNIVERSITY Three-Dimensional Template Correlation:Object Recognition in 3D Voxel Data Tom VanCourt Boston University Yongfeng Gu ECE Department Martin Herbordt CAAD lab www.bu.edu/caadlab
3D Template Matching • Increasing use of volumetric data sets • MRI / CAT, confocal microscopy, molecule structure • Increased complexity of correlation • 2D: O(n2) (x,y) O(n1) rotations = O(n3) • 3D : O(n3) (x,y,z) O(n3) rotations = O(n6) • Transform techniques help a little: • O(n3) O(n2) log n O(n6) O(n4) log n • Solution: Application-specific accelerators • Programmable off-the-shelf hardware • Custom logic design, unique to each application
Volumetric Data Sets • Complex data types • Multiple fluorescence channels • Oriented data: flow vectors • Nonlinear scoring models • True 3D data acquisition • Medical imaging (MRI, PET, CAT, …) • Confocal microscopy • Emerging techniques: Diffusion tensor tomography
COTS AND Custom? How? • Field Programmable Gate Arrays • 1000s of uncommitted elements • Custom processor built on demand • On-chip RAM bandwidth: >1TBit/sec • Massive parallelism: 100s-1000s of PEs • Accelerator is tailored to each application • ~100% payload computation cycles No load/store cycles No loop overhead cycles No address arithmetic cycles • ~0% logic dedicated to unused features
Acceleration Strategy • Standard approach: • Accelerated approach: Molecule Grid Transform Per Channel Correlation Result x FFT FFT-1 Rotated Image Products of Transforms Molecule Grid Correlation Result Rotated Addressing Direct Correlation by Systolic Array
Correlation Pipeline • Direct correlation • Beats FFT for modest problems • Generalizes correlation sum:ΣijkF(Axyz, Tijk) • Natural for FPGA implementation • Regular structure • Simple data elements • Customizable functions • High data reuse Rotated Image Access Voxel Value Rotation Systolic 3D Correlation Data Reduction Filtering
x i j y Rotated Memory Access • Load image once & reuse • Access image in rotated order via index transformation xi xj xk i x yi yj yk j = y zi zj zk k z • Allows axis scaling, mirror reversal Anisotropic: e.g. X,Y resolution ≠ Z No need for resampling • ~0 delay & buffer overhead • Strength reduction eliminates multiplication • Arithmetic cost hidden by pipelining
Voxel Value Rotation • Not needed for scalar data (RGB, gray scale, etc) • Step exists architecturally, as identity transform • For spatially oriented data (e.g. fluid flow in brain tissue) • Perform rigid rotation of image … • Then rotate oriented voxel values
T F A + Sin Sout A RAM FIFO Sout Sin RAM FIFO Correlation Array • 3D extension of conventional array • Custom unit cell Holds constant value for template Custom F(a, b) • … 1D array + line buffer Extend line to result width • … 2D array + plane buffer Extend plane to result size • … 3D array One input voxel per cycle, padded One output correlation point per cycle
FIFO line buffers Pad to result width Template data and Computation array FIFO plane buffers Pad to result depth 3D Correlation result Whole volume shown Correlation complete Result passed to data reduction filter 3D Correlation Result • Template is stored in computation array • FIFOs hold partial correlation sums
3D result ≥ image size Full result would slow host Template may occur > 1x Find multiple maxima Reporting N highest points is not effective Instead: Local max by region 8x8x8 region– 512:1 reduction More maxima, less redundancy Record exact (x,y,z) in region BUT may miss close maxima Region template size may be OK Broad maximum reported redundantly Peak Capture / Data Reduction Local maxima missed
Why Reconfigurable? 24 bit RGB • Massive parallelism, modest cost • COTS hardware, tracks technology • Application-optimized processing • Tracks application changes Ex: 1, 2, 3-channel fluorescence • Flexible performance tradeoffs • Allows non-linear scoring 8 bit Mono • Available now • PC add-ins • SGI Altix • Cray XD1 4 bit
Performance Results • Xilinx Virtex-II Pro VP70 • Measured: Score-accumulate per sec (SAC/sec) • Complex models not limited in number of bits • Simple models not limited by worst-case speed
Conclusions • Accelerators enable 3D template matching • >100x speedup over 3D FFT (n~100) • Complex data types, including vector values • Nonlinear comparisons supported • Programmability avoids common limitations • No penalty due to over-generalization • No limit due to data/function restrictions • 3D data and FPGA coprocessors match well • Both are emerging and expanding • FPGAs three years ago couldn’t do it!