1 / 53

RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing

RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing. Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany.

lulu
Download Presentation

RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing Sven Woop Jörg Schmittler Philipp Slusallek Computer Graphics Lab Saarland University, Germany

  2. „Some argue that in the very long term, rendering may best be solved by some variant of ray tracing, in which huge numbers of rays sample the environment for the eye’s view of each frame. Motivation And there will also be colonies on Mars, underwater cities, and personal jet packs.“ „Real-Time Rendering“, 1999, 1st edition, page 391

  3. Rasterization • Fundamental Operation: Project isolated triangles • No global access to the scene • All interesting visual effects need 2+ triangles • Shadows, reflection, global illumination, … • Requires multiple passes & approximations, has many issues

  4. Ray Tracing • Fundamental Operation: Trace a Ray • Global scene access • Individual rays in O(log N) • Flexibility in space and time • Automatic combination of visual effects • Demand driven • Physical light simulation • Embarrassingly parallel

  5. OpenRT Project:Realtime Ray Tracing in SW • Results: • Exploit inherent coherence • Realtime performance (> 30x) • Scalability (> 80 CPUs) • Realtime indirect lighting &caustic computation • Large model visualization

  6. OpenRT Project:Realtime Ray Tracing in SW • Results: • Exploit inherent coherence • Realtime performance (> 30x) • Scalability (> 80 CPUs) • Realtime indirect lighting &caustic computation • Large model visualization

  7. OpenRT Project:Realtime Ray Tracing in SW • Results: • Exploit inherent coherence • Realtime performance (> 30x) • Scalability (> 80 CPUs) • Realtime indirect lighting &caustic computation • Large model visualization

  8. OpenRT Project:Realtime Ray Tracing in SW • Results: • Exploit inherent coherence • Realtime performance (> 30x) • Scalability (> 80 CPUs) • Realtime indirect lighting &caustic computation • Large model visualization

  9. OpenRT Project:Realtime Ray Tracing in SW • Results: • Exploit inherent coherence • Realtime performance (> 30x) • Scalability (> 80 CPUs) • Realtime indirect lighting &caustic computation • Large model visualization  Compact Hardware?

  10. Previous Work • CPUs [Parker’99, Wald’01] • Flexible, but needs massive number of CPUs • GPUs [Purcell’02, Carr’02] • Stream programming model is too limited • Custom HW [Green’91, Hall’01, Kobayashi’02, Schmittler’04] • Mostly focused on parts of the RT pipeline

  11. RPU Approach • Shading processor • Design similar to fragment processors on GPUs • Highly parallel, highly efficient • Improved programming model • Add highly efficient recursion, conditional branching • Add flexible memory access (beyond textures) • Custom traversal hardware • High-performance kd-tree traversal

  12. RPU Design • Shader Processing Units (SPU) • Intersection, Shading, Lighting, … • Significant vector processing • High instruction level parallelism • SPU approach: instruction set similar to GPUs • 4-vector SIMD • Dual issue & pairing • Arithmetic splitting (3+1 and 2+2 vector) • Arithmetic + load • Arithmetic + conditional jump, call, return • High code density Instruction Level Parallelism +

  13. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture addressing • texload, texload4x • Random bulk memory access • load, load4x, store • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Efficient recursion • HW-managed register stack • Single cycle function call • Ray traversal function call • trace() Instruction Level Parallelism +

  14. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture addressing • texload, texload4x • Random bulk memory access • load, load4x, store • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Efficient recursion • HW-managed register stack • Single cycle function call • Ray traversal function call • trace() Instruction Level Parallelism +

  15. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture addressing • texload, texload4x • Random bulk memory access • load, load4x, store • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Efficient recursion • HW-managed register stack • Single cycle function call • Ray traversal function call • trace() Instruction Level Parallelism +

  16. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture addressing • texload, texload4x • Random bulk memory access • load, load4x, store • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Efficient recursion • HW-managed register stack • Single cycle function call • Ray traversal function call • trace() Instruction Level Parallelism +

  17. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture addressing • texload, texload4x • Random bulk memory access • load, load4x, store • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Efficient recursion • HW-managed register stack • Single cycle function call • Ray traversal function call • trace() Instruction Level Parallelism +

  18. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture addressing • texload, texload4x • Random bulk memory access • load, load4x, store • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Efficient recursion • HW-managed register stack • Single cycle function call • Ray traversal function call • trace() Instruction Level Parallelism +

  19. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Uses kd-trees: Flexible and adaptive • Handles general scenes well [Havran2000] • Simple traversal algorithm • But many instruction dependencies in inner loop • About 10 instructions • CPU: >100 cycles (un-optimized) ~15 cycles (optimized OpenRT) • TPU approach: Optimized pipeline • 1 cycle throughput Instruction Level Parallelism Optimized Pipelining + +

  20. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • High computational & memory latency • Take advantage of thread level parallelism • High utilization of hardware units • No overhead for switching threads in HW Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism + + +

  21. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Takes advantage of ray coherence • SIMD execution of threads • Reduces hardware complexity • Shared processor infrastructure • Reduces external bandwidth • Combining memory requests • Automatic handling of incoherence • Splitting and masked execution Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence + + + +

  22. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing Unit (MPU) • Objects often span many kd-tree cells • Redundant intersections • Mailboxing with a small per-chunk cache (e.g. 4 entries) • Up to 10x performance for some scenes Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence AvoidingRedundancy + + + +

  23. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing Unit (MPU) Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence AvoidingRedundancy + + + +

  24. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing Unit (MPU) Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence AvoidingRedundancy + + + +

  25. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing Unit (MPU) Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence AvoidingRedundancy + + + +

  26. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing Unit (MPU) Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence AvoidingRedundancy + + + +

  27. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing Unit (MPU) Instruction Level Parallelism Optimized Pipelining Thread LevelParallelism Control FlowCoherence AvoidingRedundancy + + + +

  28. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  29. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  30. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  31. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  32. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  33. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  34. RPU Architecture external internal SPU RPU SPU SPU Vector-Cache SPU ExternalDDR Memory Thread Generator TPU TPU TPU Node-Cache TPU List-Cache MPU

  35. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates Top-Level Object Intersector TPU/ MPU Top-Level Object Intersector Intersector Primary Shader ThreadGenerator

  36. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() Top-Level Object Intersector TPU/ MPU Top-Level Object Intersector Intersector Primary Shader ThreadGenerator

  37. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() • Ray traversal performed on TPU with mailboxing on MPU Top-Level Object Intersector TPU/ MPU Top-Level Object Intersector Intersector Primary Shader ThreadGenerator

  38. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() • Ray traversal performed on TPU with mailboxing on MPU • Data dependent call to object/intersection shader on SPU • Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) shader calls Top-Level Object Intersector TPU/ MPU Top-Level Object Intersector Intersector Primary Shader ThreadGenerator

  39. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() • Ray traversal performed on TPU with mailboxing on MPU • Data dependent call to object/intersection shader on SPU • Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) • Nested ray traversal for object-based dynamic scenes top-level traversal Geometry Intersector TPU/ MPU Top-Level Object Intersector TPU/ MPU Geometry Intersector Top-Level Object Intersector Geometry Intersector Top-Level Object Intersector Shader ThreadGenerator

  40. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() • Ray traversal performed on TPU with mailboxing on MPU • Data dependent call to object/intersection shader on SPU • Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) • Nested ray traversal for object-based dynamic scenes top-level traversal Geometry Intersector TPU/ MPU Top-Level Object Intersector TPU/ MPU Geometry Intersector Top-Level Object Intersector Geometry Intersector Top-Level Object Intersector Shader ThreadGenerator

  41. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() • Ray traversal performed on TPU with mailboxing on MPU • Data dependent call to object/intersection shader on SPU • Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) • Nested ray traversal for object-based dynamic scenes top-level traversal Geometry Intersector TPU/ MPU Top-Level Object Intersector TPU/ MPU Geometry Intersector Top-Level Object Intersector Geometry Intersector Top-Level Object Intersector Shader ThreadGenerator

  42. Ray Tracing on an RPU • Thread generation: initialize SPU registers with pixel coordinates • Primary shader generates camera ray and calls trace() • Ray traversal performed on TPU with mailboxing on MPU • Data dependent call to object/intersection shader on SPU • Programmable geometry (triangles, spheres, quadrics, bicubic splines, …) • Nested ray traversal for object-based dynamic scenes bottom-level traversal top-level traversal Geometry Intersector TPU/ MPU Top-Level Object Intersector TPU/ MPU Geometry Intersector Top-Level Object Intersector Geometry Intersector Top-Level Object Intersector Shader ThreadGenerator

  43. FPGA prototype • Xilinx Virtex II 6000 • Usage: 99% logic, 70% on-chip memory • 128 MB DDR-RAM with 350 MB/s • 24 bit floating point • Configuration of prototype RPU • 32 threads per SPU 60% usage • Chunk size of 4 95% efficiency • 12 kB caches in total 90% hit rate

  44. Performance • Single FPGA at only 66 MHz • 4 million rays/s • 20 fps @ 512x384 • Same performance as CPU • 40x clock rate (2.66 GHz) • Using our highly optimized software (OpenRT with SSE) • Linear scalability with HW resources • Tested: 4x FPGA  4x performance • Independent of scenes

  45. Prototype Performance • Technology: FPGA versus ASIC (GPU) • Large headroom for scaling performance

  46. Demo

  47. Conclusions • Programmable Architecture for Ray Tracing • GPU-like shading processor • More flexible programming model • Flexible control flow and memory access • Efficient recursion with HW-managed stack • Efficient traversal through custom hardware • Realtime performance on FPGA prototype

  48. Outlook • ASIC design • Evaluate scalability with technology • RPU as a general purpose processor • Explore non-rendering use of RPU • New fundamental operation: Tracing a ray • Basis for next generation interactive 3D graphics

  49. Siggraph 2005:More Realtime Ray Tracing • Introduction to Realtime Ray Tracing • Full day course: Wednesday, Petree Hall D • Booth 1155: Mercury Computer Systems • Realtime ray tracing product on PC clusters • Realtime ray tracing on the Cell Processor • Realtime previewing in Cinema-4D • Booth 1511: SGI • Ray tracing massive model : Boeing 777

More Related