1 / 40

Estimating Performance of a Ray-Tracing ASIC Design

Estimating Performance of a Ray-Tracing ASIC Design. Sven Woop † Erik Brunvand ‡ Philipp Slusallek †. Ray Tracing in Car Industry. Ray Tracing Games. Previous Work. Ray Tracers for Static Scenes CPU based: [OpenRT], [MLRT SIGGRAPH05]

renee
Download Presentation

Estimating Performance of a Ray-Tracing ASIC Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating Performance of aRay-Tracing ASIC Design Sven Woop† Erik Brunvand‡ Philipp Slusallek† † Saarland University, Germany ‡ University of Utah, USA

  2. Ray Tracing in Car Industry

  3. Ray Tracing Games

  4. Previous Work • Ray Tracers for Static Scenes • CPU based: [OpenRT], [MLRT SIGGRAPH05] • GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05] • Custom Hardware: Commercial Hardware (ART-VPS) Schmittler (KD Trees) [GH04] RPU (KD Trees) [SIGGRAPH05] • Ray Tracers for Dynamic Scenes • CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006] • Custom Hardware: Woop (B-KD Trees) [GH06]

  5. Outline • Previous Work • DRPU Architecture • B-KD Trees • Traversal Processor • Prototype Implementations • DRPU-FPGA • DRPU-ASICs • Conclusion

  6. Definition of B-KD Trees B-KD Tree (Bounded KD-Tree) • Binary Tree • 1D bounding intervalls for each child • Leaf nodes point to a single primitive

  7. B-KD Tree Subdivision • Bounding Volume Hierarchy (partially unbounded) • Each node can be associated with a full bounding box • Bounds may overlap • Primitives in single leaf nodes • More traversal steps as for KD Tree • Support for dynamic scenes

  8. B-KD Tree Subdivision • Bounding Volume Hierarchy (partially unbounded) • Each node can be associated with a full bounding box • Bounds may overlap • Primitives in single leaf nodes • More traversal steps as for KD Tree • Support for dynamic scenes

  9. B-KD Tree Subdivision • Bounding Volume Hierarchy (partially unbounded) • Each node can be associated with a full bounding box • Bounds may overlap • Primitives in single leaf nodes • More traversal steps as for KD Tree • Support for dynamic scenes

  10. B-KD Tree Subdivision • Bounding Volume Hierarchy (partially unbounded) • Each node can be associated with a full bounding box • Bounds may overlap • Primitives in single leaf nodes • More traversal steps as for KD Tree • Support for dynamic scenes

  11. B-KD Tree Subdivision • Bounding Volume Hierarchy (partially unbounded) • Each node can be associated with a full bounding box • Bounds may overlap • Primitives in single leaf nodes • More traversal steps as for KD Tree • Support for dynamic scenes

  12. Update of B-KD Trees Update Procedure • Bounds updated on changed geometry • B-KD tree structure remains constant • Linear updating complexity

  13. DRPU Architecture vertices from memory

  14. DRPU Architecture • Rendering Units • Highly multi-threaded • Higher hardware usage • Synchronous execution of packets of 4 rays • Memory bandwidth reduction • First level caches • Memory bandwidth reduction vertices from memory

  15. DRPU Architecture • Programmable Shading Processor • Design similar to fragment processors on GPUs • Improved Programming Model • Add highly efficient recursion • Add flexible memory access • Programming Model • Ray generation tasks • Material shading • Calls Ray Casting Units to cast rays vertices from memory

  16. DRPU Architecture • Programmable Shading Unit • Ray Casting Units • High-performance traversal and intersection • Support for continous dynamic scenes • B-KD Trees approach vertices from memory

  17. DRPU Architecture • Programmable Shading Unit • Ray Casting Units • Traversal Processor • Efficient traversal of B-KD trees vertices from memory

  18. DRPU Architecture • Programmable Shading Unit • Ray Casting Units • Traversal Processor • Efficient traversal of B-KD trees • Geometry Unit • Ray transformations • Vertex-based ray/triangle intersection [Möller Trumbore] • Shared vertices save memory 6x vertices from memory

  19. DRPU Architecture • Programmable Shading Unit • Ray Casting Units • Scene Changes • Skinning Processor • Skeleton Subspace Deformation • Re-uses Geometry Unit • Pure stream architecture vertices from memory

  20. DRPU Architecture • Programmable Shading Unit • Ray Casting Units • Scene Changes • Skinning Processor (see paper) • Skeleton Subspace Deformation • Re-uses Geometry Unit • Pure stream architecture • Update Processor • Stream-like architecture • Partial breadth-first execution • One B-KD node update per clock cycle peak vertices from memory

  21. DRPU Architecture vertices from memory

  22. Traversal of B-KD Trees Traversal of B-KD Trees • Early ray termination • Clipping of near/far interval against both bounding intervalls • Take closer child, push farther child to stack • Traversal order does not affect correctness Complexity • 4x computational cost of KD tree traversal step • 2x stack memory

  23. Traversal Processor • Stack control computes next address

  24. Traversal Processing Unit • Stack control computes next address • Next node is fetched from cache

  25. Traversal Processing Unit • Stack control computes next address • Next node is fetched from cache • 4 traversal slices compute 4x4 distances to bounding planes

  26. Traversal Processing Unit • Stack control computes next address • Next node is fetched from cache • 4 traversal slices compute 4x4 distances to bounding planes • 4 Decision Units compute per ray traversal decision

  27. Traversal Processing Unit • Stack control computes next address • Next node is fetched from cache • 4 traversal slices compute 4x4 distances to bounding planes • 4 Decision Units compute per ray traversal decision • Packet Decision Unit computes packet traversal decision • Packet goes left if exists a that ray goes left • Packet goes right if exists a ray that goes right • Packet goes from left to right if exists a ray that goes into both children from left to right

  28. Traversal Processing Unit • Stack control computes next address • Next node is fetched from cache • 4 traversal slices compute 4x4 distances to bounding planes • 4 Decision Units compute per ray traversal decision • Packet Decision Unit computes packet traversal decision • Packet goes left if exists a that ray goes left • Packet goes right if exists a ray that goes right • Packet goes from left to right if exists a ray that goes into both children from left to right  Incoherent packets possible

  29. FPGA Implementation Hardware • Xilinx Virtex4 LX160 • 66 MHz • 1.0 GB/s (limited to 0.5 GB/s) • 7.5 Gflops • 2,3 Gflops programmable • 5,2 Gflops fixed function Implementation • Packets of 4 rays • 32 packets of rays • 3x 8 KB caches, direct mapped • 24 bit floating point Virtex4 Board

  30. ASIC Design • Synthesis • Synopsys Synthesis • UMC 130nm CMOS process • Place & Route • Cadence Encounter • Some manual placements to achieve good results • Only DRPU Core • No chip interface designed (PCI Express, DRAM, ...) • No power estimation DRPU-ASIC

  31. DRPU-ASIC Hardware • UMC 130nm process • Die size: 49 mm2 • 266 MHz clock • 2.1 GB/s bandwidth • 30 Gflops Implementation Differences • Larger caches (3x 16 KB, 4-way associative) • 32 bit floating point 7mm 7mm

  32. GPU Complexity ATI R520 (October, 2005) • 90nm process • 288 mm2 die • 600 MHz clock speed • 170 GFlops programmable? • 44,8 GB/s memory bandwidth Implementation • Packets of 4 fragments • 16 fragment pipelines • 8 vertex piplines • 32 bit floating point 7mm

  33. On-Chip Parallelization • Thread Scheduler schedules packets • High bandwidth memory interface to Rendering Units

  34. DRPU4 ASIC Hardware • UMC 130nm process • 196 mm2 die (4 x 49 mm2) • 266 MHz clock • 8,5 GB/s • 120 GFlops Implementation Differences • 4x DRPU ASIC • No high level control 14mm 14mm

  35. DRPU8-ASIC Hardware • 90nm process (extrapolated using constant field scaling) • 186 mm2 die • 400 MHz clock speed • 25,6 GB/s bandwidth • 361 Gflops • 110 Gflops programmable • 471 Gflops fixed function Implementation Differences • 8x DRPU-ASIC 19,3 mm 9,6 mm

  36. Results 1024x768, shadows

  37. Results 1024x768, shadows

  38. Results for DRPU8 • Performance sufficient for game play • Room for improving image quality Gael 91.2 fps DynGael 96.0 fps

  39. Conclusions and Future Work • Ray Tracing Hardware Design • Support for programmable recursive shading • Coherent scene changes • Working Prototype Implementation • Post layout ASIC Results • Still no power results • No direct performance comparison against GPU

  40. Questions? • Project Homepage:http://www.saarcor.de • Computer Graphics Lab at Saarland University:http://graphics.cs.uni-sb.de

More Related