1 / 16

Latency considerations of depth-first GPU ray tracing

Latency considerations of depth-first GPU ray tracing. Michael Guthe University Bayreuth Visual Computing. Depth-first GPU ray tracing. Based on bounding box or spatial hierarchy Recursive traversal Usually using a stack Threads inside a warp may access different data May also diverge.

deon
Download Presentation

Latency considerations of depth-first GPU ray tracing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latency considerations of depth-first GPU ray tracing Michael Guthe University BayreuthVisual Computing

  2. Depth-first GPU ray tracing • Based on bounding box or spatial hierarchy • Recursive traversal • Usually using a stack • Threads inside a warp may access different data • May also diverge

  3. Performance Analysis • What limits performance of the trace kernel? • Device memory bandwidth? Obviously not!

  4. Performance Analysis • What limits performance of the trace kernel? • Maximum (warp) instructions per clock? Not really!

  5. Performance Analysis • Why doesn’t the kernel fully utilize the cores? • Three possible reasons: • Instruction fetch • e.g. due to branches • Memory latency • a.k.a. data request • mainly due to random access • Read after write latency • a.k.a. execution dependency • It takes 22 clock cycles (Kepler) until the result is written to a register

  6. Performance Analysis • Why doesn’t the kernel fully utilize the cores? • Profiling shows: Memory & RAW latency limit performance!

  7. Reducing Latency • Standard solution for latency: • Increase occupancy • No option due to register pressure • Relocate memory access • Automatically performed by compiler • But not between iterations of a while loop • Loop unrolling for triangle test

  8. Reducing Latency • Instruction level parallelism • Not directly supported by GPU • Increases number of eligible warps • Same effect as higher occupancy • We might even spend some more registers • Wider trees • 4-ary tree means 4 independent instructions paths • Almost doubles the number of eligible warps during node tests • Higher width increase number of node tests, 4 is optimum

  9. Reducing Latency • Tree construction • Start from root • Recursively pull largest child up • Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible

  10. Reducing Latency • Overhead: sorting intersected nodes • Can have two independent paths with parallel merge sort • We don‘t need sorting for occlusion rays 0.7 0.3  0.2 0.3 0.7 • 0.2 •  0.3 0.2 • 0.7 •  0.2 0.3 • 0.7 • 

  11. Results • Improved instructions per clock • Doesn’t directly translate to speedup

  12. Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.

  13. Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.

  14. Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.

  15. Results • Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.

  16. Results • Latency is still performance limiter • Mostly improved memory latency

More Related