480 likes | 660 Views
Introduction to Realtime Ray Tracing Course 41. Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald. Hardware for Realtime Ray Tracing. Custom Hardware for Realtime Ray Tracing Characteristics and requirements RPU Design and Implementation
E N D
Introduction to Realtime Ray TracingCourse 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald
Hardware for Realtime Ray Tracing • Custom Hardware for Realtime Ray Tracing • Characteristics and requirements • RPU Design and Implementation • GPU + Recursion + Custom Traversal HW • Programming Model • FPGA Prototype • Performance and Scalability
Ray Tracing on CPUs • Characteristics • Commodity, well understood HW • High FP performance, yet still too slow • Limited parallelism, bulky clusters • Poor silicon usage (e.g. cache) • Outlook • Multi-core designs are coming • Will still take too long
Ray Tracing on GPUs • Characteristics • Very high raw FP performance • High degree of parallelism • Fast development cycle • Stream programming model • Still too limited for efficient ray tracing • No support for recursion • Limited memory access
Ray Tracing Characteristics: kd-Tree Traversal • One-dimensional computation along ray • Compute location of d relative to t_min / t_max • Iterate or recurse with updated t_max / t_max t_max d t_max d t_max t_min t_min t_min split split d Near: t_min< t_max < d Both: t_min < d < t_max Far: d < t_min < t_max
Ray Tracing Characteristics: kd-Tree Traversal t_max • Inner traversal loop tmp = node.split – ray.origin d = tmp * 1/ray.direction near = d > t_min far = d < t_max if (near & far) push(node.far, d, t_max) if (near) iterate(node.near, t_min, d) else iterate(node.far, d, t_max) • Advantages of using kd-trees • Simple and fast traversal & building algorithm • Robust & very good handling of large scenes d t_min split
Ray Tracing Characteristics: kd-Tree Traversal • Traversal Processing • 50-80 k-D steps per ray @ 10 instructions/step many instructions many clock cycles • Serial dependency low pipeline efficiency, stalls, latency • Limited but flexible control flow and memory access Custom HW unit • One clock tick per traversal step (fully pipelined) • Up to 100:1 improvement
Ray Tracing Characteristics: Intersection • Intersection computation • Triggered by traversal at every leaf node • Called with: ray and address of geometry • Option 1: Custom hardware [SaarCOR’05] • Option 2: Software on programmable processor • Can be implemented efficiently • Enables arbitrary programmable primitives Do not use costly dedicated hardware
Ray Tracing Characteristics: Shading • Shading computation • Triggered by finished ray traversal • Called with: ray, hit point, shader-id, address of parameters • Characteristics: • General-purpose computation, many 3-/4-vectors • Needs support for efficient texture and memory access • Needs support for arbitrary recursive tracing rays • E.g. support dependent ray tracing Main feature of ray tracing: Do not put limits on it
Ray Tracing Characteristics: Coherence • Ray coherence • Neighboring primary rays • Traverse highly similar kd-node in same order • Often hit same geometric primitives • Often execute the same shader, access same textures, … • Similar for shadow rays to one light source • Often (but not always) applies for secondary rays HW should take advantage of this coherence
Previous Work • SaarCOR I • Fixed function ray tracing chip [GH’05]
RPU Approach • Take GPUs as basis and core component • Highly parallel, highly efficient • Improve programming model • Add efficient recursion, conditionals • Add memory access options • Add custom traversal unit • Slave to RPU • Performs indirect, data dependent functions calls
RPU Design • Shader Processing Units (SPU) • General purpose computation • For shading, geometry, lighting computations • Operates on 4-component vectors • Integer and float • Dual issue, split vector • GPU-like instruction set • Arbitrary read/write • Texture addressing mode • No texture filtering SW
RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Efficient traversal of k-D trees • Communicates with SPU over dedicated registers
RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Increases usage of HW resources • Hides latency due to • Memory access • Instruction dependencies • Long traversal operations • Separate thread pool for SPU & TPU • Software scheduling (compiler) • No overhead for switching threads • Increases resources (mainly register file)
RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • SIMD execution (SPUs & TPUs) • Takes advantage of coherence • Reduces hardware complexity • Can combine of memory requests • Reduces external bandwidth • Must allow for incoherence • Chunks may split at conditionals • Inactive sub-chunk put on stack • Masked execution • Worst case: serial computation
RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing (MPU) • Per thread caching mechanism • Avoids multiple processing of same kd-tree entry (e.g. triangle) • 10x performance for some scenes
SPU Vector Registers • All registers have 4- component (float or integer) • R0 to R15: General registers • Index into a HW managed register stack • Allows for single-cycle function call • P0 to P15: shader parameters • I0 to I3: data read from memory • A = (A0,A1,A2,A3) • Memory addressing • ORG, DIR, ... • TPU communication registers
Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return
Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return
Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return
Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return
Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return
Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return
Ray Triangle IntersectionUnit-Triangle Test ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy (<0 or >=1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return ; load triangle transformation load4x A.y,0 ; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return Input Arithmetic (dot products) Multi-issue (arith. & cond.)
Shader Processing UnitPipelining Read Instruction mov R0,R1 * mov R2,R3 * mov R0,R2 Read 3 Source Registers Swizzeling Memory Access * * * * + + + + Thread Control Clamp Branching RCP, RSQ Masking StackControl Writeback I0 – I3 Writeback Masking Writeback
RPU Programming Model Light Source Shader Light Source Shader • ↨: Direct function calls • ↔: Indirect function calls via TPU TPU/ MPU Lighting Shader shadow rays ... TPU/ MPU secondaryrays Surface/ BRDF Shader ... SPU Processing TPU / MPU Processing TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader • Threads are started for each pixel • Registers initialized from an input stream • 2D Hilbert curve generator sampling the screen • Memory stream for multi-pass • Shader computes ray TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader • Threads are started • Registers initialized from an input stream • 2D Hilbert curve generator sampling the screen • Memory stream for multi-pass TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader • Shooting Primary Rays • Ray traversal performed onthe TPU • Started in top-level kd-tree • Intersector transforms ray into local coordinate system TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader • Shooting Primary Rays (II) • Transformed ray traversed through object kd-tree on TPU • Geometry intersection performed on programmable SPU • Programmable geometry: triangles, spheres, bicubic splines, quadrics, … TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader object-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader object-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader • Surface shading performed on programmable SPU • Surface shader is called directly from primary shader • Arguments passed on HW stack • May trace secondary rays at any time: reflection, refraction, … • Writing shaders is easy due to global access to the scene and physically-based computation TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
RPU Programming Model Light Source Shader Light Source Shader • Light properties and illumination can be abstracted using function calls • Illumination shader iterates over all light sources • For each light source a Light source shader is called TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray
PrototypePerformance • FPGA prototype • Xilinx Virtex II 6000 • 128 MB DDR-RAM at 350 MB/s • PCI bus for up-/download (no VGA) • Single RPU at only 66 MHz • Up to 4 million rays per second • Up to 20 fps @ 512x384 • Same ray tracing performance as Intel P4 @ 2.66 GHz
Scalability • Larger Chunk Size • Less ray coherence • More data is accessed • Increased cache bandwidth • Larger caches
Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Limited by • VLSI technology • Memory bandwidth • FPGA prototype versus current GPUs • Floating point units 50x • Memory bandwidth 100x • Clock rate 7x
Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Multiple chips on a board • Fast interconnect for data exchange • Cache sizes accumulate • Managed through virtual memory [Schmittler’2003] • Limited through external bandwidth due to scene changes
Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Multiple chips on a board • Multiple boards in a PC • Similar to today’s PC clusters in a much smaller form factor
Future Work • Support for fully dynamic scenes • Vertex shader + building kd-trees • Efficient photon mapping • kd-tree construction + kNN filtering • OpenRT-API [Dietrich’03] • ASIC prototype
Questions? http://graphics.cs.uni-sb.de http://www.OpenRT.de http://www.SaarCOR.de