Clusters, Technology, and “All that Stuff”

Clusters, Technology, and “All that Stuff” Philip Papadopoulos San Diego Supercomputer Center 31 August 2000

Motivation for COTS Clusters Gigabit Networks! - Myrinet, SCI, FC-AL, Giganet,GigE,ATM, Servernet and (Infiniband, soon) • Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor • Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection • Leverage HW, commodity SW (Linux, NT), build key technologies • Technology dislocation coming very soon!

High Performance Communication Switched Multigigabit, User-level access Networks Switched 100 Mbit OS mediated access • Level of network interface support + NIC/network router latency • Overhead and latency of communication  deliverable bandwidth • High-performance communication enables Programmability! • Low-latency, low-overhead, high-bandwidth cluster communication • … much more is needed … • Usability issues, I/O, Reliability, Availability • Remote process debugging/monitoring • Techniques for scalable cluster management

SDSC’s First HPC Cluster Myrinet Fast Ether Power • 16 Compute Nodes (32 Proc) (25 Gflops, 20 Gbit/s BW bisection) • 2 Front-end servers • 11 GB total memory • 216 GB total disk (72 GB on Front ends) Front Back

FY00 Cluster Hardware (August Deployment) • SDSC Cluster: 90 IA-32 2-way Nodes • Keck 1 Cluster: 16 IA-32 2-way Nodes • Keck 2 Cluster: 38 IA-32 2-way Nodes • System Stakeholders: NBCR, SDSC, Keck, Onuchic • Vendors: • IBM and Compaq nodes @ each site • Compaq Servernet II interconnect at Keck sites • Myrinet 2000 (Myricom, Inc.) interconnect at SDSC

FY01 SDSC Cluster Hardware Plans • Additional 104 nodes via loaners and donations expected • Additional stakeholders possible (e.g., SIO) • Goal: Get to 256 nodes

Key Vendors • IBM • Best node pricing • Donations • Strong relationship • Compaq • Best packaging • Best for Keck centers • Equipment loans • Offers evaluation of another interconnect • Myricom • Myrinet2000 --> >> 200 MB/sec inter-node BW

Application Collaborations at UCSD • Keck 1 (Ellisman, 3D Electron Microscope Tomography) • Keck 2 (McCammon, Onuchic, Computation Chemistry • SIO (Stammer, Global Ocean Modeling) • NBCR (Baldridge, National Biomedical Computational Resource) • Goals: • Assistance, consulting, software packaging, troubleshooting, federation of clusters on campus with high-speed network (our own version of chaos)

Keck Sites • Working with two key sites to work out kinks in transferring management/infrastructure technology to application groups • Start of a Campus Grid for Computing • Microscopy Lab (Ellisman) • 32-way (64 Processor) Cluster by Jan ‘01 • Servernet II interconnect • 3D Tomography • Computational Chemistry Lab • 64-way (128 Processor) Cluster by Jan ‘01 • Servernet II Interconnect

HPC Machine History • The 1980’s – Decade of Vector Supers • Many New Vendors • The 1990’s – Decade of MPPs • Many Vendors Lost (Dead Supercomputer Society) • The 2000’s – Decade of Clusters • End-users as “vendors”? • The 2010’s – The Grid • Harnessing Chaos (or perhaps just Chaos )

HPC Reality: Losing Software Capability • As HPC has gone through technological shifts, we have lost capability in software • HPC has ever-decreasing influence on computing trends • AOL, Amazon, Exodus, …, WalMart, Yahoo, ZDNet is where the $$ are • Challenge is to leverage commodity trends with a community-maintained HPC software base (“The Open Source Buzz”)

Technological Shifts Coming Now (or Soon) • Memory bandwidth of COTS systems • 4 – 8X increase this year (RIMM, Double Clocked SDRAM) • Increased I/O performance • 4X improvement today (64bit/66MHz) • 10X (PCI-X) within 12 months • Increased network performance/decrease in $$ • 1X infiniband (2.5 Gbits/sec) – hardware convergence • Intel designing Mboards with multiple I/O busses and on-board Infiniband. • 64 bit integer performance everywhere.

Taking Stock • Clusters are Proven Computational Engines (Many existence proofs) • Upcoming technology dislocation makes them very attractive at multiple scales • Today’s Vendors care about HPC issues, but the economic realities make it harder and harder to fully support our unique software stack. • Can they be turned into general-purpose, highly-scalable, maintainable, production-quality machines? • YES! (But there is work to do)

Cluster InterconnectToday • Myrinet (Myricom) • 1.28 Gb/s, full duplex • 18 us latency • 145 MB/s bandwidth • $1500 / port • Servernet II (Compaq)

Cluster InterconnectTomorrow • Myrinet (Myricom) • 2.0 Gb/s, full duplex • 9 us latency • 250 MB/s bandwidth • $ ??? / port • Available: today • Infiniband

Cluster Compute NodeToday

Cluster Compute NodeTomorrow 1.6 GHz 64 bit @ 400MHz 3.2 GB/s • In the next 9 months, every speed and feed gets at least a 2x bump! 2 channels 16bit @ 800 MHz 3.2 GB/s PCI-X 64 bit @ 133 MHz 1.06 GB/s

Commodity CPU – Pentium 3 • 0.8 Gflops (Peak) • 1 Flop / cycle @ 800 MHz • 25.6 GB/s L2 cache feed • 800 MHz * 256-bit • 1.06 GB/s Memory-I/O bus • 133 MHz * 64-bit

44 GB/s L2 Cache and Control (256-bit @ 1.4 GHz) BTB Store AGU Load AGU ALU Integer RF ALU 3.2 GB/s System Interface (64-bit @ 400 MHz) ALU 3 3 Trace Cache ALU Decoder BTB & I-TLB Rename/Alloc uop Queues L1 D-Cache and D-TLB Schedulers FP move FP store FP RF FMul FAdd MMX SSE uCode ROM Commodity CPU – Pentium 4

Commodity CPU – Pentium 4 • 2.8 Gflops • 2 Flops / cycle @ 1.4 GHz • 128-bit vector registers (Streaming SIMD Extensions • Can apply operations on 2 64-bit floating point values per clock • 44 GB/s L2 cache feed • 3.2 GB/s Memory-I/O bus

Commodity Frequency Trend

Looking Forward • Historical CAGRs • Pentium Core: 41% CAGR • P6 Core (Pentium Pro, Pentium II, Pentium III): 49% CAGR • 1.9 GHz clock in 2H01 • 3.8 Gflops • L2 cache feed of 60 GB/s!

Power3 • Current CPU used in Blue Horizon (222 MHz) • 4 Flops/cycle Peak • Fused Multiply-Add • 888 MFlops for BH • L2 Cache Feed: • 7.1 GB/s • 256-bit @ 222MHz • Memory-I/O Bus Feed: • 1.8 GB/s • 128-bit @ 111MHz

Power3 @ 375 MHz

Power3 @ 375 MHz • 1.5 GFlops (Peak) • 4 Flops / cycle @ 375MHz • 8 GB/s L2 Cache Feed: • 256-bit @ 250MHz • Memory-I/O Bus Feed: • 1.5 GB/s • 128-bit @ 93.75MHz

Power4 • Chip Multiprocessor (CMP) • 4.0 GFlop / CPU (Peak) • 50 GB/s / CPU L2 cache feed • 2.5 GB/s / CPU memory bus feed • Numbers in the figure are aggregate • 10 GB/s / CPU in 8-way configuration • 5 GB/s / CPU I/O feed • Available 2H01

CPU Summary * GB/s Per CPU

Commodity Benefits • Ride the commodity performance curve

Commodity Benefits • Ride the commodity RDRAM price curve

MPICH – GM Performance

Some Small NPB Results BH Faster Cluster Faster

Some Realities (and Advantages)? • More Heterogeneity • Node performance • Node architecture • System can be designed with different resources in different partitions • Large Memory • Large Disk • Bandwidth • Staged acquisitions can take advantage of commodity trends

Some Deep Dark Secrets • Clusters are phenomenal price/performance computational engines … • Can be hard to manage without experience • High-performance I/O is still unsolved • Finding out where something has failed increases at least linearly as cluster size increases • Not cost-effective if every cluster “burns” a person just for care and feeding • We’re working to change that…

It’s all in the software … … and the management

Cluster Projects have focused on high-performance messaging • BIP (Basic Interface for Parallelism) [Linux] • MVIA – Berkeley Lab Modular VIA project • Active Messages – Berkeley NOW/Millennium • GM – From Myricom • General purpose (what we use on our Linux Cluster), • Real World Computing Partnership – Japanese consortium • UNET – Cornell • High-performance over ATM and Fast Ethernet • HPVM – Fast Messaging and NT

We’re starting with infrastructure • Concentrating on management, deployment, scaling, automation, security • Complete cluster install of 16 nodes in under 10 minutes • Reinstallation (unattended) in under 7 minutes • Easy to maintain software consistency • If there's a question about what's on a machine, don't think about it, reinstall it! • Easy to upgrade • Add new packages to the system configuration (stored on a front-end machine) then reinstall all the machines • Automatic scheduling of upgrades • can schedule upgrades through a batch system or periodic script

Working with the MPI-GM Model: Usher/Patron • Sender transmits to a hostname & port number • Receiver de-multiplexes by port number • Port numbers assigned from a single configuration file • Consequences • Port numbers must be agreed upon a priori • “Guaranteed” collision of port numbers when multiple jobs are run. • Usher/Patron (developed by Katz) removes a centralized database for port assignment • Uses RPC-based reservation/claim system on each node to dynamically assign port numbers to applications. • Timeouts allow for recovery of allocated ports

Job launch • MPI-Launch • Handles reserve/claim/vacate system • Starts jobs with SSH • Runs the first node in the foreground • Interactive node (MPI Rank 0) • Runs subsequent nodes in the background • Non-interactive nodes • Multiple cluster-wide jobs now work • A secure and scalable replacement for mpirun

Other software … • Portland Group F77/F90 Compiler • Portable Batch System • Trivial configuration at the moment • Rudimentary up/down health monitoring • GM Hardware/software on each node for High-Performance network. • MPI (GM-aware version of MPICH 1.2) • Globus 1.1.3 – Integration with batch system not done yet. • Public Key Certificate Client (and server) (Link, Schroeder) • Standard Redhat Linux 6.2 on each node

Conclusions • Clustering is actively being pursued at SDSC • Becoming a gathering point of technology for NPACI • Aggressively attacking issues for ultra-scale • Actively transferring technology to make clusters more available to application groups • Working to build a Campus Grid of Clusters • Want to harness expertise at SDSC to • Define the production system environment • Build/port/deploy the needed infrastructure components • Web site coming …. But not ready yet. • New mailing list: clusters@sdsc.edu

Clusters, Technology, and “All that Stuff”