440 likes | 456 Views
Explore the innovative architecture for point sets, neighbor search module, and advanced caching mechanism powered by FPGA technology. Achieve efficient processing, manipulation, and rendering of point-based graphics.
E N D
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud,M. Botsch, M. Gross Graphics Hardware 2008
Motivation • Point-based graphics established • Powerful algorithms • Representation • Processing • Manipulation • Rendering • Decomposition • Get neighborhood • Operate on neighbors Graphics Hardware 2008
Motivation • GPUs not suited for getting neighborhood • SIMD • Incoherent branching • Dynamic data structures slow • Recursive calls not supported • CPUs • Small number of FPUs • Inflexible memory caches Courtesy of NVIDIA Courtesy of Intel Graphics Hardware 2008
Contributions • Hardware architecture for point sets • Neighbor search module • Novel advanced caching mechanism • Reconfigurable processing module • Programmability using FPGA compiler • FPGA prototype and measurements • Small & Lean Integration into multi-core CPU/GPU possible Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture and Prototype • Results • Conclusion Graphics Hardware 2008
Related Work Kd-Tree [Bentley 75] kNN on GPUs[Ma and McCool 02] Kd-Tree on GPUs [Popov et al. 07] Kd-Tree Hardware [Woop et al. 05] [Woop et al. 06] Graphics Hardware 2008
Related Work Adaptive SPH Fluid Simulation [Adams et al. ‘07] Algebraic Moving Least Squares, [Guennebaud and Gross ‘07] Linear Moving Least Squares, [Adamson and Alexa ’04] Graphics Hardware 2008
Linear Moving Least Squares • Implicit surface definition defined by set of points Graphics Hardware 2008
Linear Moving Least Squares • Implicit surface definition defined by set of points x Graphics Hardware 2008
10 Linear Moving Least Squares ni pi x Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x’ x ’ Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x’’ x ’ ’ Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x’’’ x ’ ’ ’ Graphics Hardware 2008
Linear Moving Least Squares • Surface defined by points projecting onto themselves x Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
Spatial Search • Spatial search: kNN and eNN • Common in most point operations • Based on kd-tree • Example eNN: Graphics Hardware 2008
Spatial Search • kNN search similar to eNN search: • Start with infinite radius • Sort leaf points into priority queue • Shrink radius with every point sorted Graphics Hardware 2008
Coherent Neighbor Cache(eNN) • Find neighbors in slightly bigger radius • Re-use result for spatially close query Re-use if Graphics Hardware 2008
Coherent Neighbor Cache(kNN, exact) • Find (k+1) neighbors • Re-use result for spatially close query Re-use if Graphics Hardware 2008
Coherent Neighbor Cache(kNN, approximation) • Approximation error e • Enlarge radius Re-use if Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
The Architecture Host Graphics Hardware 2008
Coherent Neighbor Cache 0 0 0 1 1 1 n n n • Eight cached neighborhoods • Problem: parallel queries in kd-tree module • Interleave spatially similar queries Graphics Hardware 2008
Kd-Tree Traversal Graphics Hardware 2008
NodeRecurse • Kd-tree structure on chip • 16 threads • Pipelining and multi-threading Graphics Hardware 2008
Stacks • 16 stacks • Parallel read/write • Bounded in depth • 6 bytes per thread per recursion Graphics Hardware 2008
Leaf • 16 parallel priority queues (1-cycle ops) • Queues store pointers and distances • Bandwidth bottleneck Graphics Hardware 2008
Processing Module • Multithreaded quad-port bank of 16 registers • 128 threads • Programmability using FPGA-technology Graphics Hardware 2008
Further Data • Implemented on two FPGAs • 64 bit DDR DRAM • Interconnection: no overhead • Resource usage regs and LUTs • Virtex 2 Pro 100 (kNN): 26% registers, 38% LUTs • Virtex 2 Pro 70 (MLS):47% registers, 52% LUTs • Clock frequency: 75 MHz Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
Applications • Tested on various applications • PCI interface of prototype slow • [Weyrich et al. 04] • [Adams et al. 07] Graphics Hardware 2008
Results kNN 75 MHz 2200 MHz 1200 MHz CUDA: x4 ASIC estimate, 500 MHz x6.6 Number of queries CUDA w/o sort: x4.0 CPU: x1.5 CUDA: x2.4 CUDA w/o sort: x3.1 CPU: x1.4 CUDA: x1.6 FPGA: x1 CPU: x1.1 FPGA: x1 FPGA: x1 Number of Neighbors Graphics Hardware 2008
Results kNN • Small hardware footprint • FPGA slightly slower • Realistic clock frequency Prototype faster than CPU/GPU 75 MHz 2200 MHz 1200 MHz CUDA: x4 ASIC estimate, 500 MHz x6.6 Number of queries CUDA w/o sort: x4.0 CPU: x1.5 CUDA: x2.4 CUDA w/o sort: x3.1 CPU: x1.4 CUDA: x1.6 FPGA: x1 CPU: x1.1 FPGA: x1 FPGA: x1 Number of Neighbors Graphics Hardware 2008
Results MLS FPGA faster than CPU 75 MHz 2200 MHz 1200 MHz Number of queries MLS CUDA x3.8 • kNN bottleneck • FPGA • GPU FPGA: x1 MLS CPU: x0.4 Number of Neighbors Graphics Hardware 2008
Coherent Neighbor Cache CPU, e=0.1 Number of queries FPGA, e=0.1 FPGA, exact Level of coherence Graphics Hardware 2008
Results Approximation Error (MLS projection) MLS Error e approximation no approx. Graphics Hardware 2008
Results Approximation Error (MLS projection) Cache hits Cache Hits e approximation Graphics Hardware 2008
Approximation Error (visual) Graphics Hardware 2008
Approximation Error (visual) • Coherent Neighbor Cache: • Not optimal for exact queries • Approximate queries • Can be tolerated in most cases • Greatly increases performance • Even for small approximations Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
Conclusion • Novel hardware architecture for • Nearest-neighbor searches • Generic meshless processing operators • Cache exploiting spatial coherence • Good performance considering resources • Possible GPU integration Graphics Hardware 2008
Future Work • Programmable data structure • Support different data structures • Programmability in data structure • Construction on-chip • ‘Real’ programmability in point processing module Graphics Hardware 2008
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud,M. Botsch, M. Gross Graphics Hardware 2008