Latency considerations of depth-first GPU ray tracing

Latency considerations of depth-first GPU ray tracing Michael Guthe University BayreuthVisual Computing

Depth-first GPU ray tracing • Based on bounding box or spatial hierarchy • Recursive traversal • Usually using a stack • Threads inside a warp may access different data • May also diverge

Performance Analysis • What limits performance of the trace kernel? • Device memory bandwidth? Obviously not!

Performance Analysis • What limits performance of the trace kernel? • Maximum (warp) instructions per clock? Not really!

Performance Analysis • Why doesn’t the kernel fully utilize the cores? • Three possible reasons: • Instruction fetch • e.g. due to branches • Memory latency • a.k.a. data request • mainly due to random access • Read after write latency • a.k.a. execution dependency • It takes 22 clock cycles (Kepler) until the result is written to a register

Performance Analysis • Why doesn’t the kernel fully utilize the cores? • Profiling shows: Memory & RAW latency limit performance!

Reducing Latency • Standard solution for latency: • Increase occupancy • No option due to register pressure • Relocate memory access • Automatically performed by compiler • But not between iterations of a while loop • Loop unrolling for triangle test

Reducing Latency • Instruction level parallelism • Not directly supported by GPU • Increases number of eligible warps • Same effect as higher occupancy • We might even spend some more registers • Wider trees • 4-ary tree means 4 independent instructions paths • Almost doubles the number of eligible warps during node tests • Higher width increase number of node tests, 4 is optimum

Reducing Latency • Tree construction • Start from root • Recursively pull largest child up • Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible

Reducing Latency • Overhead: sorting intersected nodes • Can have two independent paths with parallel merge sort • We don‘t need sorting for occlusion rays 0.7 0.3  0.2 0.3 0.7 • 0.2 •  0.3 0.2 • 0.7 •  0.2 0.3 • 0.7 • 

Results • Improved instructions per clock • Doesn’t directly translate to speedup

Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.

Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.

Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.

Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.

Results • Latency is still performance limiter • Mostly improved memory latency

Latency considerations of depth-first GPU ray tracing

Latency considerations of depth-first GPU ray tracing

Presentation Transcript

Ray Tracing

Ray Tracing on GPU

Ray-tracing

Ray Tracing

Ray Tracing

ARCHES: GPU Ray Tracing

Ray Tracing by GPU

Ray Tracing

GPU Ray-tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray tracing via GPU Rasterization

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing