160 likes | 338 Views
Latency considerations of depth-first GPU ray tracing. Michael Guthe University Bayreuth Visual Computing. Depth-first GPU ray tracing. Based on bounding box or spatial hierarchy Recursive traversal Usually using a stack Threads inside a warp may access different data May also diverge.
E N D
Latency considerations of depth-first GPU ray tracing Michael Guthe University BayreuthVisual Computing
Depth-first GPU ray tracing • Based on bounding box or spatial hierarchy • Recursive traversal • Usually using a stack • Threads inside a warp may access different data • May also diverge
Performance Analysis • What limits performance of the trace kernel? • Device memory bandwidth? Obviously not!
Performance Analysis • What limits performance of the trace kernel? • Maximum (warp) instructions per clock? Not really!
Performance Analysis • Why doesn’t the kernel fully utilize the cores? • Three possible reasons: • Instruction fetch • e.g. due to branches • Memory latency • a.k.a. data request • mainly due to random access • Read after write latency • a.k.a. execution dependency • It takes 22 clock cycles (Kepler) until the result is written to a register
Performance Analysis • Why doesn’t the kernel fully utilize the cores? • Profiling shows: Memory & RAW latency limit performance!
Reducing Latency • Standard solution for latency: • Increase occupancy • No option due to register pressure • Relocate memory access • Automatically performed by compiler • But not between iterations of a while loop • Loop unrolling for triangle test
Reducing Latency • Instruction level parallelism • Not directly supported by GPU • Increases number of eligible warps • Same effect as higher occupancy • We might even spend some more registers • Wider trees • 4-ary tree means 4 independent instructions paths • Almost doubles the number of eligible warps during node tests • Higher width increase number of node tests, 4 is optimum
Reducing Latency • Tree construction • Start from root • Recursively pull largest child up • Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible
Reducing Latency • Overhead: sorting intersected nodes • Can have two independent paths with parallel merge sort • We don‘t need sorting for occlusion rays 0.7 0.3 0.2 0.3 0.7 • 0.2 • 0.3 0.2 • 0.7 • 0.2 0.3 • 0.7 •
Results • Improved instructions per clock • Doesn’t directly translate to speedup
Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.
Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.
Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.
Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.
Results • Latency is still performance limiter • Mostly improved memory latency