Learning From the Stanford/DOE Visualization Cluster

Learning From the Stanford/DOE Visualization Cluster Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan

Outline • Stanford’s current cluster • Design decisions • Performance evaluation • Bottleneck evaluation • Cluster “Landscape” • General classification • Bottleneck evaluation • Stanford’s next cluster • Design goals • Research directions

Stanford/DOE Visualization Cluster The Chromium Cluster

Cluster Configuration (Jan. 2000) • Cluster: 32 graphics nodes + 4 server nodes • Computer: Compaq SP750 • 2 processors (800 MHz PIII Xeon, 133MHz FSB) • i840 core logic (big issue for vis-clusters) • Simultaneous fast graphics and networking • Network: 64-bit, 66 MHz PCI • Graphics: AGP-4x • 256 MB memory • 18GB SCSI 160 disk (+ 3*36GB on servers) • Graphics (Sept. 2002) • 16 NVIDIA GeForce3 w/ DVI (64 MB) • 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB) • Network • Myrinet 64-bit, 66 MHz (LANai 7)

Graphics Evaluation • NVIDIA GeForce3 • 25 MTri/s triangle rate observed • 680 MPix/s fill rate observed • NVIDIA GeForce4 • 60 MTri/s triangle rate observed • 800 MPix/s fill rate observed • Read Pixels performance • 35 MPix/s (140 MB/s) RGBA • 22 MPix/s (87 MB/s) Depth • Draw Pixels performance • 45 MPix/s (180 MB/s) RGBA • 21 MPix/s (85 MB/s) Depth

Network Evaluation • Myrinet LANai 7 PCI64A boards • Theoretical Limit: 160 MB/s • 142 MB/s observed peak under Linux • ~100 MB/s observed sustained under Linux • ServerNet not chosen • Driver support • Large switching infrastructure required • Gigabit Ethernet • Performance and scalability concerns

Myrinet Issues • Fairness: Clients starved of network resources • Implemented credit scheme to minimize congestion • Lack of buffering in switching fabric • Causes poor performance in high load conditions • Open issue Partitioned Cluster Unpartitioned Cluster

i840 Chipset Evaluation • 66MHz 64bit PCI performance not full speed: • 210 MB/s PCI read (40% of theoretical peak) • 288 MB/s PCI write (54% of theoretical peak) • Combined read/write ~121 MB/s • AGP • Fast Writes / Side Band Addressing unstable under Linux

Sort-First Performance • Configuration • Application runs application on client • Primitives distributed to servers • Tiled Display • 4x3 @ 1024x768 • Total resolution: 4096x2304, 9 Megapixel • Quake 3 • 50 fps • Atlantis • 450 fps

Sort-Last Performance • Configuration • Parallel rendering on multiple nodes • Composite to final display node • Volume Rendering on 16 nodes • 1.57 GVox/s [Humphreys 02] • 1.82 GVox/s (tuned) 9/02 • 256x256x1024 volume1 rendered twice 1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an NIH/NCRR National Resource)

Cluster Accomplishments • Development Platform • WireGL • Chromium • Cluster configuration replicated • Interactive Performance • 256x512x1024 volume @ 15fps • 9 Megapixel Quake3 @ 50fps

Sources of Bottlenecks • Sort-First • Packing speed (processor) • Primitive distribution (network and bus) • Rendering (processor and graphics chip) • Sort-Last • Rendering (graphics chip) • Composite (network, bus, and read/draw pixels)

Bottleneck Evaluation – Stanford • Sort-First: Processor and Network • Sort-Last: Network and Read/Draw

The Landscape of Graphics Clusters • Many Options • Low End <$2500/node • Mid End ~$5000/node • High End >$7500/node • Tradeoffs • Different bottlenecks • Price/Performance • Scalability • Usage • Evaluation • Based off of published benchmarks and specs

Cluster Interconnect Options • Many choices • GigE • ~100 MB/s • Myrinet 2000 (http://www.myrinet.com) • 245MB/s • SCI/Dolphin (http://www.dolphinics.com) • 326 MB/s • Quadrics (http://www.quadrics.com) • 340 MB/s • Future options • 10 GigE • Infiniband • HyperTransport

Low End • General Definition • Single CPU • Consumer Mainboard • Integrated Graphics • High Speed commodity network • Example Node Configuration • Nvidia NForce2 • AMD Athlon 2400+ • 512 MB DDR • GigE and 10/100 • 1U rack chassis • Estimated Price: $1500

Bottleneck Evaluation – Low End • Bus/Network limited

Mid End • General Definition • Dual Processor • “Workstation” mainboard • High performance bus • 64-bit PCI or PCI-X • High Speed Commodity / Low end cluster interconnect • High-End consumer graphics board • Example Node Configuration • Intel i860 • Dual Intel P4 Xeon 2.4GHz • 2GB RDRAM • ATI Radeon 9700 • GigE onboard + Myrinet 2000 • 2U rack chassis • Estimated Price: $4000

Bottleneck Evaluation – Mid End • Sort-First: Network limited • Sort-Last: Read/Draw and Network limited

High End • General Definition • Dual or Quad processor • Cutting edge bus • PCI-X, HyperTransport, PCI Enhanced • High Speed Commodity/ High end cluster interconnect • “Professional” graphics board • RAID system • Example Node Configuration • ServerWorks GC-WS • Dual P4 Xeon 2.6GHz • Nvidia Quadro4 900XGL • 4GB DDR • GigE onboard + Infiniband • Estimated Price: $7500

Bottleneck Evaluation – High End • Sort-First: Well balanced • Sort-Last: Read/Draw limited

Balanced System is Key • Only as fast as slowest component • Spend money where it matters!

Goals for Next Cluster • Performance • Sort-Last • 5 GVox/s • 1 GTri/s • Sort-First at 4096x2304 • Quake3 @ >100fps • Research • Remote visualization • Time-varying datasets • Compositing

What we plan to build • 16 Node cluster, 1U nodes • Mainboard chipsets • Intel Placer • ServerWorks GC-WS • AMD Hammer • Memory • 2-4GB • Graphics Chip • Nvidia NV30 • ATI R300/350 • Interconnect • Infiniband, Quadrics • Disk • IDE RAID or SCSI

Continuing Chipset Issues • Why do chipsets perform so poorly? • “Workstation” • Intel i860 • 215 MB/s read (40% of theoretical) • 300 MB/s write (56% of theoretical) • AMD 760MPX • 300 MB/s read (56% of theoretical) • 312 MB/s write (59% of theoretical) • “Server” • ServerWorks ServerSet III LE • 423 MB/s read (79% of theoretical) • 486 MB/s write (91% of theoretical) • Why can’t a “server” have an AGP slot? Performance numbers from http://www.conservativecomputer.com

Ongoing Bottlenecks • Readback performance • Will be fixed “soon” • Hardware compositing? • Chipset Performance • Achieve fraction of theoretical • Need faster busses in commodity chipsets • Network Performance • Scalability • Fast is VERY expensive

Conclusions • What we still need • More vendors • More chipsets • More performance • Graphics Clusters are getting better • Chipsets • Interconnects • Form factor • Processing • Graphics Chips • Things are really starting to get interesting!

Learning From the Stanford/DOE Visualization Cluster

Learning From the Stanford/DOE Visualization Cluster

Presentation Transcript

Cluster-based Visualization

SESRI Workshop on Survey-based Experiments

Workshop on XML-Based Library Applications

sv7: Blazing Visualization on a Commodity Cluster

“ Orientation Workshop on Outcome Based Accreditation ”

EGGSViz : Visualization and Exploration of Gene Clusters

Visualization-based Brain Mapping

Scientific Visualization Workshop

A Visualization Model Based on Adjacency Data

Commodity Computing Clusters - next generation supercomputers?

Multi-Commodity Flow Based Routing

GPU-based Visualization Algorithms

Automata-based Algorithms Visualization Framework

Info Visualization Workshop

Can Commodity Linux Clusters Scale to Petaflops?

A Workshop on “Laser Based Manufacturing ”

WORKSHOP B5 Data visualization techniques

Unreliable Transport Protocol for Commodity-Based OpenGL Distributed Visualization

Large Data Visualization on Distributed Memory Multi-GPU Clusters

DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters

NPACI Panel on Clusters

Joint Workshop on Results-based Management