270 likes | 613 Views
Learning From the Stanford/DOE Visualization Cluster. Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan. Outline. Stanford’s current cluster Design decisions Performance evaluation Bottleneck evaluation Cluster “Landscape” General classification Bottleneck evaluation
E N D
Learning From the Stanford/DOE Visualization Cluster Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan
Outline • Stanford’s current cluster • Design decisions • Performance evaluation • Bottleneck evaluation • Cluster “Landscape” • General classification • Bottleneck evaluation • Stanford’s next cluster • Design goals • Research directions
Stanford/DOE Visualization Cluster The Chromium Cluster
Cluster Configuration (Jan. 2000) • Cluster: 32 graphics nodes + 4 server nodes • Computer: Compaq SP750 • 2 processors (800 MHz PIII Xeon, 133MHz FSB) • i840 core logic (big issue for vis-clusters) • Simultaneous fast graphics and networking • Network: 64-bit, 66 MHz PCI • Graphics: AGP-4x • 256 MB memory • 18GB SCSI 160 disk (+ 3*36GB on servers) • Graphics (Sept. 2002) • 16 NVIDIA GeForce3 w/ DVI (64 MB) • 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB) • Network • Myrinet 64-bit, 66 MHz (LANai 7)
Graphics Evaluation • NVIDIA GeForce3 • 25 MTri/s triangle rate observed • 680 MPix/s fill rate observed • NVIDIA GeForce4 • 60 MTri/s triangle rate observed • 800 MPix/s fill rate observed • Read Pixels performance • 35 MPix/s (140 MB/s) RGBA • 22 MPix/s (87 MB/s) Depth • Draw Pixels performance • 45 MPix/s (180 MB/s) RGBA • 21 MPix/s (85 MB/s) Depth
Network Evaluation • Myrinet LANai 7 PCI64A boards • Theoretical Limit: 160 MB/s • 142 MB/s observed peak under Linux • ~100 MB/s observed sustained under Linux • ServerNet not chosen • Driver support • Large switching infrastructure required • Gigabit Ethernet • Performance and scalability concerns
Myrinet Issues • Fairness: Clients starved of network resources • Implemented credit scheme to minimize congestion • Lack of buffering in switching fabric • Causes poor performance in high load conditions • Open issue Partitioned Cluster Unpartitioned Cluster
i840 Chipset Evaluation • 66MHz 64bit PCI performance not full speed: • 210 MB/s PCI read (40% of theoretical peak) • 288 MB/s PCI write (54% of theoretical peak) • Combined read/write ~121 MB/s • AGP • Fast Writes / Side Band Addressing unstable under Linux
Sort-First Performance • Configuration • Application runs application on client • Primitives distributed to servers • Tiled Display • 4x3 @ 1024x768 • Total resolution: 4096x2304, 9 Megapixel • Quake 3 • 50 fps • Atlantis • 450 fps
Sort-Last Performance • Configuration • Parallel rendering on multiple nodes • Composite to final display node • Volume Rendering on 16 nodes • 1.57 GVox/s [Humphreys 02] • 1.82 GVox/s (tuned) 9/02 • 256x256x1024 volume1 rendered twice 1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an NIH/NCRR National Resource)
Cluster Accomplishments • Development Platform • WireGL • Chromium • Cluster configuration replicated • Interactive Performance • 256x512x1024 volume @ 15fps • 9 Megapixel Quake3 @ 50fps
Sources of Bottlenecks • Sort-First • Packing speed (processor) • Primitive distribution (network and bus) • Rendering (processor and graphics chip) • Sort-Last • Rendering (graphics chip) • Composite (network, bus, and read/draw pixels)
Bottleneck Evaluation – Stanford • Sort-First: Processor and Network • Sort-Last: Network and Read/Draw
The Landscape of Graphics Clusters • Many Options • Low End <$2500/node • Mid End ~$5000/node • High End >$7500/node • Tradeoffs • Different bottlenecks • Price/Performance • Scalability • Usage • Evaluation • Based off of published benchmarks and specs
Cluster Interconnect Options • Many choices • GigE • ~100 MB/s • Myrinet 2000 (http://www.myrinet.com) • 245MB/s • SCI/Dolphin (http://www.dolphinics.com) • 326 MB/s • Quadrics (http://www.quadrics.com) • 340 MB/s • Future options • 10 GigE • Infiniband • HyperTransport
Low End • General Definition • Single CPU • Consumer Mainboard • Integrated Graphics • High Speed commodity network • Example Node Configuration • Nvidia NForce2 • AMD Athlon 2400+ • 512 MB DDR • GigE and 10/100 • 1U rack chassis • Estimated Price: $1500
Bottleneck Evaluation – Low End • Bus/Network limited
Mid End • General Definition • Dual Processor • “Workstation” mainboard • High performance bus • 64-bit PCI or PCI-X • High Speed Commodity / Low end cluster interconnect • High-End consumer graphics board • Example Node Configuration • Intel i860 • Dual Intel P4 Xeon 2.4GHz • 2GB RDRAM • ATI Radeon 9700 • GigE onboard + Myrinet 2000 • 2U rack chassis • Estimated Price: $4000
Bottleneck Evaluation – Mid End • Sort-First: Network limited • Sort-Last: Read/Draw and Network limited
High End • General Definition • Dual or Quad processor • Cutting edge bus • PCI-X, HyperTransport, PCI Enhanced • High Speed Commodity/ High end cluster interconnect • “Professional” graphics board • RAID system • Example Node Configuration • ServerWorks GC-WS • Dual P4 Xeon 2.6GHz • Nvidia Quadro4 900XGL • 4GB DDR • GigE onboard + Infiniband • Estimated Price: $7500
Bottleneck Evaluation – High End • Sort-First: Well balanced • Sort-Last: Read/Draw limited
Balanced System is Key • Only as fast as slowest component • Spend money where it matters!
Goals for Next Cluster • Performance • Sort-Last • 5 GVox/s • 1 GTri/s • Sort-First at 4096x2304 • Quake3 @ >100fps • Research • Remote visualization • Time-varying datasets • Compositing
What we plan to build • 16 Node cluster, 1U nodes • Mainboard chipsets • Intel Placer • ServerWorks GC-WS • AMD Hammer • Memory • 2-4GB • Graphics Chip • Nvidia NV30 • ATI R300/350 • Interconnect • Infiniband, Quadrics • Disk • IDE RAID or SCSI
Continuing Chipset Issues • Why do chipsets perform so poorly? • “Workstation” • Intel i860 • 215 MB/s read (40% of theoretical) • 300 MB/s write (56% of theoretical) • AMD 760MPX • 300 MB/s read (56% of theoretical) • 312 MB/s write (59% of theoretical) • “Server” • ServerWorks ServerSet III LE • 423 MB/s read (79% of theoretical) • 486 MB/s write (91% of theoretical) • Why can’t a “server” have an AGP slot? Performance numbers from http://www.conservativecomputer.com
Ongoing Bottlenecks • Readback performance • Will be fixed “soon” • Hardware compositing? • Chipset Performance • Achieve fraction of theoretical • Need faster busses in commodity chipsets • Network Performance • Scalability • Fast is VERY expensive
Conclusions • What we still need • More vendors • More chipsets • More performance • Graphics Clusters are getting better • Chipsets • Interconnects • Form factor • Processing • Graphics Chips • Things are really starting to get interesting!