210 likes | 372 Views
Real-time Ray Tracing on GPU with BVH-based Packet Traversal. Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek. Background. GPUs attractive for ray tracing High computational power Shading oriented architecture GPU ray tracers Carr – the ray engine
E N D
Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek
Background • GPUs attractive for ray tracing • High computational power • Shading oriented architecture • GPU ray tracers • Carr – the ray engine • Purcell – Full ray tracing on the GPU, based on grids • Ernst – KD trees with parallel stack • Carr, Thrane & Simonsen – BVH • Foley, Horn, Popov – KD trees - stackless traversal
Motivation • So far • Interactive RT on GPU, but • Limited model size • No dynamic scene support • The G80 – new approach to the GPU • High performance general purpose processor with graphics extensions • PRAM architecture • BVH allow for • Dynamic/deformable scenes • Small memory footprint • Goal: Recursive ordered traversal of BVH on the G80
GPU Architecture (G80) • Multi-threaded scalar architecture • 12K HW threads • Threads cover latencies • Off-chip memory ops • Instruction dependencies • 4 or 16 cycles to issue instr. • 16 (multi-)cores • 8-wide SIMD • 128 scalar cores in total • Cores process threads in 32 wide SIMD chunks … Chunk Pool Chunk Pool … … … … … … … … Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 32 Thread 32 Thread` 32 Thread 32 Thread 32 Thread` 32 Thread 32 Thread 32 Multi-Core 1 Multi-Core 16 IP IP Thread 1 Thread 1 … … Thread 32 Thread 32
GPU Architecture (G80) • Scalar register file (8K) • Partitioned among running threads • Shared memory (16KB) • On-chip, 0 cycle latency • On-board memory (768MB) • Large latency (~ 200 cycles) • R/W from within thread • Un-cached • Read-only L2 cache (128KB) • On chip, shared among all threads Multi-Core 1 Multi-Core 16 Thread 1 Registers … … Shared Memory Thread 32 Registers L2 Cache (128KB) On-board memory
Programming the G80 • CUDA • C based language with parallel extensions • GPU utilization at 100% only if • Enough threads are present (>> 12K) • Every thread uses less than 10 registers and 5 words (32 bit) of shared memory • Enough computations per transferred word of data • Bandwidth << computational power • Adequate memory access pattern to allow read combining
Performance Bottlenecks • Efficient per-thread stack implementation • Shared memory too small – will limit parallelism • On-board memory – uncached • Need enough computations between stack ops • Efficient memory access pattern • Use texture caches • However, only few words of cache / thread • Read successive memory locations in successive threads of a chunk • Single roundtrip to memory (read combining) • Cover latency with enough computations
Ray Tracing on the G80 • Map each ray to one thread • Enough threads to keep the GPU busy • Recursive ray tracing • Use per-thread stack stored on on-board memory • Efficient, since enough computations are present • But how to do the traversal ? • Skip pointers (Thrane) – no ordered traversal • Geometric images (Carr) – single mesh only • Shared stack traversal
SIMD Packet Traversal of BVH • Traverse a node with the whole packet • At an internal node: • Intersect all rays with both children and determine traversal order • Push far child (if any) on a stack and descend to the near one with the packet • At a leaf: • Intersect all rays with contained geometry • Pop next node to visit from the stack
PRAM Basics • The PRAM model • Implicitly synchronized processors (threads) • Shared memory between all processors • Basic PRAM operations • Parallel OR in O(1) • Parallel reduction in O(log N) false true false true false true 12 32 11 9 + + 11 9 44 20 + 20 11 9 64
PRAM Packet Traversal of BVH • The G80 – PRAM machine on chunk level • Map packet chunk, ray thread • Threads behave as in the single ray traversal • At leaf: Intersect with geometry. Pop next node from stack • At node: Decide which children to visit and in what order. Push far child • Difference: • How rays choose which node to visit first • Might not be the one they want to
PRAM Packet Traversal of BVH • Choose child traversal order • PRAM OR to determine if all rays agree on visiting the same node first • The result is stored in shared memory • In case of divergence: choose child with more ray candidates • Use PRAM SUM on +/- 1 for each thread, -1 left node • Look at result’s sign • Guarantees synchronous traversal of BVH
PRAM Packet Traversal of BVH • Stack: • Near & far child – the same for all threads => store once • Keep stack in shared memory. Only few bits per thread! • Only Thread 0 does all stack ops. • Reading data: • All threads work with the same node / triangle • Sequential threads bring in sequential words • Single load operation. Single round trip to memory • Implementable in CUDA
Analysis • Coherent branch decisions / memory access • Small footprint of the data structure • Can trace up to 12 million triangle models • Program becomes compute bound • Determined by over/under-clocking the core/memory • No frustums required • Good for secondary rays, bad for primary • Can use rasterization for primary rays • Implicit SIMD – easy shader programming • Running on a GPU – shading “for free”
Dynamic Scenes • Update parts / whole BVH and geometry on GPU • Use GPU for RT and CPU for BVH construction / refitting • Construct BVH using binning • Similar to Wald RT07 / Popov RT06 • Bin all 3 dimensions using SIMD • Results in > 10% better trees • Measured as SAH quality, not FPS • Speed loss is almost negligible
Conclusions • New recursive PRAM BVH traversal algorithm • Very well suited for the new generation of GPUs • No additional pre-computed data required • First GPU ray tracer to handle large models • Previous implementations were limited to < 300K • Can handle dynamic scenes • By using the CPU to update the geometry / BVH
Future Work • More features • Shaders, adaptive anti-aliasing, … • Global illumination • Code optimizations • Current implementation uses too many registers
CUDA Hello World __global__ voidaddArrays(int *arr1, int *arr2) { unsigned t = threadIdx.x + blockIdx.x * blockDim.x; arr1[t] += arr2[t]; } int main() { int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304); int *ta1, *ta2; cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304); for(inti = 0; i < 4194304; i++) { inArr1[i] = rand(); inArr2[i] = rand(); } cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice); cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice); addArrays<<<dim3(4194304 / 512, 1, 1), dim3(512, 1, 1)>>>(ta1, ta2); cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost); for(inti = 0; i < 4194304; i++) printf("%d ", inArr1[i]); return 0; }