320 likes | 522 Views
Kd -Jump. A Path-Preserving Stackless Traversal for Faster Isosurface Ray tracing on GPUs. David Meirion Hughes. Ik Soo Lim. Bangor University, UK. Problem Setting and Previous work. Problem Setting. Problem Setting: Ray Tracing. Tracing rays from camera Find the intersections
E N D
Kd-Jump A Path-Preserving Stackless Traversal for FasterIsosurface Ray tracing on GPUs. David MeirionHughes. IkSoo Lim. Bangor University, UK.
Problem Setting and Previous work Problem Setting
Problem Setting:Ray Tracing • Tracing rays from camera • Find the intersections • Avoid uninteresting areas • Acceleration structure • Division of space • Requires ray traversal
Problem Setting: Traversal of Kd-Trees tnear tfar node* • Downward Traversal • Two branch choices • Remember furthest • Traverse nearest • Test for intersections • If branch had no hit? • Traversal Restore • Go back to other branch Ray Stack 0.5 0.75 0x1...
Problem Setting: GPUs • Several MPU’s • Parallel execution • Kernels • Thousands of threads • light-weight code • On-chip memory very fast • On-board memory slow
Problem Setting: Ray tracing on GPUs One Memory Transaction Two Memory Transactions Three Memory Transactions Ray 1: tnear tnear tnear tfar tfar tfar node* node* node* • Stack – still a problem? • Memory Size • Coalesced Access • One Stack element: • Ray Segment • Node address/look-up • times depth-of-tree • times ray-count • One kernel call Ray 2: Ray 3: ...
Previous Work:Stackless Traversal Ray 1: tnear tnear tnear tfar tfar tfar node* node* node* • Avoid using stack • Current thinking • Less memory • No global memory use • Faster Ray 2: Ray 3: ...
Previous Work:Stackless Traversal Tested Twice • Avoid using stack • Kd-Restart • Restart From Root • + Very little memory • - Revisits previous nodes • - Longer thread life • - Exacerbates incoherence
Previous Work:Stackless Traversal Tested Twice • Avoid using stack • Kd-Restart • Kd-Backtrack • Backtrack up tree • + Very little memory • + Better than Kd-Restart • - Revisits previous nodes • - Longer thread life • - Exacerbates incoherence
Previous Work:Stackless Traversal (per node) Additional Pointers • Avoid using stack • Kd-Restart • Kd-Backtrack • Ropes • Nodes have neighbour links • + Shorter ray life • - Lots of extra memory
Motivation and Description Kd-Jump
Kd-Jump:Motivation • Goal: • Same path as Stack method • Least-amount memory • How: • Indices rather than pointers • Down traverse with equation • Return using inverse • Binary bits for return markers
Kd-Jump:Index Reference • Each node reference by index • x, y, z, etc... • depth Level Memory Blocks [x,y,z] memory map
Kd-Jump:Method Description • Traversal into children • Update an index element • Determined by the split dimension • Multiply by 2 • Add child offset f C = 2x + f [x,y,z] x-dimension split [2x+f,y,z] f=0 f=1
Kd-Jump:Method Description • Traversal back to parent • Apply inverse of downward step • Can replace f with floor function • Do not need to consider what f was • f = 0 or 1, only (C-f)/2 = x floor(C/2) = x [floor(x/2),y,z] x-dimension split [x,y,z]
Kd-Jump:Method Description • Traversal to common parent • Apply inverse on all indices • Divide elements by power of 2 • Number of splits • Matrix of Split information • Store in constant memory • (cached) • (alternative) Store on the fly 0 0 1 0 1 1 floor(x/21), floor(y/22), ... 2 2 2 1 1 0 1 2 2 2 [x,y,z]
Kd-Jump:Method Description • Determine jump amount • Mark common parents • 1 bit • Store in MSB order • On return • Count right-trail zero bits • This is the return depth • Subtract from current depth • Jump amount 32-bit Register 0 1 0 0 0
Kd-Jump:Method Description • Re-clip Ray • Bounds stored or computed Bound X Bound Y
Kd-Jump:Scope • Nodes referenced with indices • Traversal equations invertible • Forget route choices in inverse • Index-to-memory map • Limit wasted memory • Balanced kd-tree • implicit kd-tree • Requires node bounds • Re-computed with implicit kd-tree
Kd-Jump:Isosurfacing with implicit Kd-tree • Wald’s implicit Kd-tree • min/max of node branch • left-balanced • No-waste memory map • Bounds/splits computed
Implementation:Isosurfacing with implicit Kd-tree • Minor differences • Node test • Test prior to traversing • Reduces number of returns • Stack, kd-jump, kd-restart
Results:Isosurfacing with implicit Kd-tree • Kd-Jump faster • Ray time-active important (kd-restart) • Stack only slightly slower • High occupancy (75%) • High ray coherence = automatic coalesced access Frames Per Second. Average across multiple iso/view
Results:Isosurfacing with implicit Kd-tree • All use one 32-bit register • stack_size, tfar_max, depth_flags • Stack memory allocated for all rays • Single kernel • Constant memory as fast as registers • Once data cached. Memory Use
Analysis:Kd-Jump • Theoretical performance. • Memory access not hidden • However, perfectly coalesced.
Analysis:Kd-Jump • Bottlenecks • Stack memory bound • Kd-Jump computation bound
Hybrid Kd-tree:Exploiting Texture caching • Build implicit tree • Depth threshold • Volume stepping • Texture cache • Very fast • Threshold depth? • intersection method • Iso-surface • View direction
Hybrid Kd-tree:Results Frames Per Second. Average across multiple view
Conclusions • Kd-Jump • Stackless • Index based • Immediate backtrack to common parent • No dependency on ray coherency • At least if bounds can be computed • Hybrid Kd-Jump • Texture cache over Acceleration Structure • Variable depth threshold of branches • View, intersection method, iso-value.
Conclusions • Future prediction • Memory access and speed improving • Current trend • Usefulness of Stackless • Reduced memory cost • Reduce dependency on coherency • Less iterations (Ropes) than stack • Big question: One kernel, verses many? • Stack favours one kernel. • i.e., no reorganising (can break coalesced access) • Can organise into groups of same depth though? • Many kernels = better device occupancy • Memory access better hidden
Future Work • Kd-Jump with General Kd-Tree’s? • Real-time explicit from implicit • CUDA 3.0 • Dynamic Warps? • Ideal for ray tracing? • Inter-device communication?
Addendum:Indices for general case kd-trees? • Nodes need bounds for re-clip • Accept the cost? • Compute them somehow? • BVH stores them anyway • Memory Map • Very difficult to remove wasted space • Feasible to minimise waste?