DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters

DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science Rutgers University John Zahorjan Department of Computer Science & Engineering University of Washington

Overview • Improve real-time rendering performance using distributed rendering on commodity clusters • Real-time rendering -> interactive rendering applications • Improve performance -> Render more complex scenes at interactive rates • Why real-time rendering? • A critical component of an increasing number of continuous media applications • Virtual reality, data visualization, CAD, flight simulators, etc. • Rendering performance will continue to be a bottleneck • Model complexity increasing as fast (or faster) than hardware performance • Part of the challenge is to leverage increasingly powerful hardware accelerators

Challenges • How to structure the distributed renderer to leverage hardware-assisted rendering • Information that is useful for work partitioning and assignment may be hidden in the hardware rendering pipeline • How to minimize non-parallelizable overheads (avoiding Amdhal’s Law) • How to decouple bandwidth requirement from the complexity of the scene and the cluster size

Image Layer Decomposition (ILD) • Per-frame rendering load is partitioned using ILD • presented in IPDPS 2000 • Briefly review ILD because it affects DDDDRRaW’s architecture and performance • Basic idea: assign scene objects such that sets of objects assigned to different nodes are not mutually occlusive • Advantages of using ILD • Do not need position of polygons in 2D • This information may be hidden inside the graphics pipeline • Do not need Z-buffer information • This reduces the required bandwidth by at least 50%

3 1 2 3 4 5 4 1 5 6 6 2 Image Layer Decomposition (ILD) Spatial partitioning

3 5 4 1 6 2 ILD: Work Assignment • Non-mutually occlusive assignment -> legal for back-to-front compositing • Use heuristic-based algorithm to • Balance load across cluster • Minimize the screen real-estate covered by each assignment Legal

Implementation: Architecture Display Node App. VRML Scene, Display Window • Partitioning • Assignment • Decompress • Compositing Display Viewpoint DDDDRRaW Library Work Assignment • Rendering • Compress Partial Image DDDDRRaW Library DDDDRRaW Library DDDDRRaW Library DDDDRRaW Library … Rendering Nodes

Implementation Details • Implemented an optimization to ILD: dynamic selection of octants to be rendered • Minimize overhead of geometric transformation due to polygon splitting (in scene decomposition) • Compression of image layers before communication • Reduce bandwidth requirement to accommodate slower networks (eg., 100 Mb/s LANs) • Use dynamic clipping to enforce octant boundaries for scene with smooth shading and/or texturing • Simplification to ease implementation of prototype – this clipping could/should be done statically • 20-25 percent overhead for 5 of our 6 test scenes that would not be present in a production system

Performance Measurement • Application: VRML viewer • VRweb – http://www.iicm.edu/vrwave • Collected 6 VRML scenes from the web • Use fix paths through scenes to measure performance in terms of average frame rate (frames/sec) • Two clusters representing different points in the technology spectrum • Cluster of 5 SGI O2s • 180 MHz Mips R5000, 256 MB memory, SGI Graphics Accelerator, 100 Mb/s switched Ethernet LAN • IRIX 6.5.7 • Cluster of 13 PCs • Pentium III 800 MHz, 512 MB memory, Giganet 1 Gb/s cLAN • Red Hat Linux (kernel 2.2.14), Mesa 3D library version 3.2

Two Test Scenes

Overheads on SGI O2s

Overheads on PCs

Speed-up of Average Frame Rate on O2s

Speed-up of Average Frame Rate on PCs

Speed-up of Rendering Component on PCs

Conclusions • Can build an ILD-based distributed renderer to significantly improve real-time rendering performance on commodity hardware • DDDDRRaW currently scales to modestly sized cluster • This limitation is due to non-optimal hardware configurations • This is NOT because more suitable hardware is not available! • Expect good scalability to clusters of 16-32 nodes • Overlapping communication with computation increases average frame rate but ONLY at the expense of increasing frame latency • Problem is CPU contention for rendering & communication • Either need dedicated hardware or can only optimize after reaching 10-15 fps, the nominal interactive frame rate • Project URL: www.cs.washington.edu/research/ddddrraw/

Overlapping Communication & Computation • Communication and compression are significant sources of overhead • Apply standard parallel optimization technique: overlap communication of rendered image layers for one frame with rendering of the next • Requires pipelining of DDDDRRaW

The DDDDRRaw Pipeline Display Node ILD Send Receive Decompress Composite & Display Send Receive Rendering Nodes Stage 1 Stage 3 Render Compress Stage 2

Average Frame Rates

Average Frame Latency

DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters