High performance cluster technology: the HPVM experience

High performance cluster technology: the HPVM experience Mario Lauria Dept of Computer and Information Science The Ohio State University Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Thank You! • My thanks to the organizers of SAIC 2000 for the invitation • It is an honor and privilege to be here today Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Acknowledgements • HPVM is a project of the Concurrent Systems Architecture Group - CSAG (formerly UIUC Dept. of Computer Science, now UCSD Dept. of Computer Sci. & Eng.) • Andrew Chien (Faculty) • Phil Papadopuolos (Research faculty) • Greg Bruno, Mason Katz, Caroline Papadopoulos (Research Staff) • Scott Pakin, Louis Giannini, Kay Connelly, Matt Buchanan, Sudha Krishnamurthy, Geetanjali Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu, Ju Wang (Graduate Students) • NT Supercluster: collaboration with NCSA Leading Edge Site • Robert Pennington (Technical Program Manager) • Mike Showerman, Qian Liu (Systems Programmers) • Qian Liu*, Avneesh Pant (Systems Engineers) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Outline • The software/hardware interface (FM 1.1) • The layer-to-layer interface (MPI-FM and FM 2.0) • A production-grade cluster (NT Supercluster) • Current status and projects (Storage Server) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Motivation for cluster technology Gigabit/sec Networks - Myrinet, SCI, FC-AL, Giganet,GigabitEthernet, ATM • Killer micros: Low cost Gigaflop processors here for a few kilo$$’s /processor • Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s $$/ connection • Leverage HW, commodity SW (Windows NT), build key technologies • high performance computing in a RICH and ESTABLISHED software environment Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Application Program “Virtual Machine Interface” Actual system configuration Ideal Model: HPVM’s • HPVM = High Performance Virtual Machine • Provides a simple uniform programming model, abstracts and encapsulates underlying resource complexity • Simplifies use of complex resources Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

HPVM = Cluster Supercomputers • High Performance Cluster Machine (HPVM) • Standard APIs hiding network topology, non-standard communication sw • Turnkey Supercomputing Clusters • high performance communication, convenient use, coordinated resource management • Windows NT and Linux, provides front-end Queueing & Mgmt (LSF integrated) PGI HPF MPI Put/Get Global Arrays HPVM 1.0 Released Aug 19, 1997 Fast Messages Myrinet and Sockets Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Motivation for a new communication software • “Killer networks” have arrived ... • Gigabit links, moderate cost (dropping fast), low latency routers • … however network software only delivers network performance for large messages. 1Gbit network (Ethernet, Myrinet) 125ms ovhd N1/2=15KB Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Motivation (cont.) • Problem: Most messages are small Message Size Studies < 576 bytes [Gusella90] 86-99% <200B [Kay&Pasquale] 300-400B avg size [U Buffalo monitors] • => Most messages/applications see little performance improvement. Overhead is the key (LogP, Culler, et.al. studies) • Communication is an enabling technology; how to fulfill its promise? Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Fast Messages Project Goals • Explore network architecture issues to enable delivery of underlying hardware performance (bandwidth, latency) • Delivering performance means: • considering realistic packet size distributions • measuring performance at the application level • Approach: • minimize communication overhead • Hardware/software, multilayer integrated approach Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Getting performance is hard! • Slow Myrinet NIC processor (~5 MIPS) • Early I/O bus (Sun’s Sbus) not optimized for small transfers • 24 MB/s bandwidth with PIO, 45 MB/s with DMA Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

25 20 PIO 15 Bandwidth (MB/s) Buffer Mgmt 10 Flow Control 5 0 Msg Size 16 32 64 128 256 512 Simple Buffering and Flow Control • Dramatically simplified buffering scheme, still performance critical • Basic buffering + flow control can be implemented at acceptable cost. • Integration between NIC and host critical to provide services efficiently • critical issues: division of labor, bus management, NIC-host interaction Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

20 FM 18 16 1Gb Ethernet 14 12 10 Bandwidth(MB/s) 8 6 4 2 0 16 32 64 128 256 512 1024 2048 Msg Size (Bytes) FM 1.x Performance (6/95) • Latency 14 ms, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95] • Hardware limits PIO performance, but N1/2 = 54 bytes • Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Illinois Fast Messages 1.x • API: Berkeley Active Messages • Key distinctions: guarantees(reliable, in-order, flow control), network-processor decoupling (dma region) • Focus on short-packet performance • Programmed IO (PIO) instead of DMA • Simple buffering and flow control • user space communication Sender: FM_send(NodeID,Handler,Buffer,size); // handlers are remote procedures Receiver: FM_extract() Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

The FM layering efficiency issue • How good is the FM 1.1 API? • Test: build a user-level library on top of it and measure the available performance • MPI chosen as representative user-level library • porting of MPICH (ANL/MSU) to FM • Purpose: to study what services are important in layering communication libraries • integration issues: what kind of inefficiencies arise at the interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997] Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

20 15 FM Bandwidth (MB/s) 10 MPI-FM 5 0 16 32 64 128 256 512 1024 2048 Msg Size MPI on FM 1.x • First implementation of MPI on FM was ready in Fall 1995 • Disappointing performance, only fraction of FM bandwidth available to MPI applications Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

100 90 80 70 60 % Efficiency 50 40 30 20 10 0 16 32 64 128 256 512 1024 2048 Msg Size MPI-FM Efficiency • Result: FM fast, but its interface not efficient Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Header Source buffer Header Destination buffer MPI FM MPI-FM layering inefficiencies • Too many copies due to header attachment/removal, lack of coordination between transport and application layers Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

The new FM 2.x API • Sending • FM_begin_message(NodeID,Handler,size), FM_end_message() • FM_send_piece(stream,buffer,size) // gather • Receiving • FM_receive(buffer,size) // scatter • FM_extract(total_bytes) // rcvr flow control • Implementation based on use of a lightweight thread for each message received Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

MPI FM MPI-FM 2.x improved layering Header Source buffer Header Destination buffer • Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

100 90 FM 80 70 MPI-FM 60 50 Bandwidth (MB/s) 40 30 20 10 0 8 4 16 32 64 128 256 512 4196 8192 1024 2048 16384 32768 65536 MPI on FM 2.x • MPI-FM: 91 MB/s, 13ms latency, ~4 ms overhead • Short messages much better than IBM SP2, PCI limited • Latency ~ SGI O2K Msg Size Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

100 90 80 70 60 50 40 30 20 10 0 4 8 16 32 64 128 256 512 1024 2048 4196 8192 16384 32768 65536 Msg Size MPI-FM 2.x Efficiency • High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98] • Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%) % Efficiency Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

MPI-FM at work: the NCSA NT Supercluster • 192 Pentium II, April 1998, 77Gflops • 3-level fat tree (large switches), scalable bandwidth, modular extensibility • 256 Pentium II and III, June 1999, 110 Gflops (UIUC), w/ NCSA • 512xMerced, early 2001, Teraflop Performance (@ NCSA) 110 GF, June 99 77 GF, April 1998 Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

The NT Supercluster at NCSA • Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA • Myrinet Network, HPVM, Fast Messages • Microsoft NT OS, MPI API, etc. 192 Hewlett Packard, 300 MHz 64 Compaq, 333 MHz Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

HPVM III Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

MPI applications on the NT Supercluster • Zeus-MP (192P, Mike Norman) • ISIS++ (192P, Robert Clay) • ASPCG (128P, Danesh Tafti) • Cactus (128P, Paul Walker/John Shalf/Ed Seidel) • QMC (128P, Lubos Mitas) • Boeing CFD Test Codes (128P, David Levine) • Others (no graphs): • SPRNG (Ashok Srinivasan), Gamess, MOPAC (John McKelvey), freeHEP (Doug Toussaint), AIPS++ (Dick Crutcher), Amber (Balaji Veeraraghavan), Delphi/Delco Codes, Parallel Sorting => No code retuning required (generally) after recompiling with MPI-FM Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000 Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

NCSA NT Supercluster Solving Navier-Stokes Kernel Single Processor Performance: MIPS R10k 117 MFLOPS Intel Pentium II 80 MFLOPS Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000 Danesh Tafti, Rob Pennington, Andrew Chien NCSA

Excellent Scaling to 128P, Single Precision ~25% faster Solving 2D Navier-Stokes Kernel (cont.) Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 4094x4094) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000 Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

Near Perfect Scaling of Cactus - 3D Dynamic Solver for the Einstein GR Equations Cactus was Developed by Paul Walker, MPI-Potsdam UIUC, NCSA Ratio of GFLOPs Origin = 2.5x NT SC Summer Institute on Advanced Computation Wright State University - August 20-23, 2000 Paul Walker, John Shalf, Rob Pennington, Andrew Chien NCSA

Quantum Monte Carlo Origin and HPVM Cluster Origin is about 1.7x Faster than NT SC Summer Institute on Advanced Computation Wright State University - August 20-23, 2000 T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance Nanomaterials Team)

Supercomputer Performance Characteristics Mflops/ProcFlops/ByteFlops/NetworkRT Cray T3E 1200 ~2 ~2,500 SGI Origin2000 500 ~0.5 ~1,000 HPVM NT Supercluster 300 ~3.2 ~6,000 Berkeley NOW II 100 ~3.2 ~2,000 IBM SP2 550 ~3.7 ~38,000 Beowulf (100Mbit) 300 ~25 ~500,000 • Compute/communicate and compute/latency ratios • Clusters can provide programmable characteristics at a dramatically lower system cost Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

SHMEM Global Arrays MPI BSP Fast Messages Myrinetor VIA Shared Memory (SMP) HPVM today: HPVM 1.9 • Added support for: • Shared memory • VIA interconnect • New API: • BSP Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Show me the numbers! • Basics • Myrinet • FM: 100+MB/sec, 8.6 µsec latency • MPI: 91MB/sec @ 64K, 9.6 µsec latency • Approximately 10% overhead • Giganet • FM: 81MB/sec, 14.7 µsec latency • MPI: 77MB/sec, 18.6 µsec latency • 5% BW overhead, 26% latency! • Shared Memory Transport • FM: 195MB/sec, 3.13 µsec latency • MPI: 85MB/sec, 5.75 µsec latency Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Bandwidth Graphs • FM bandwidth usually a good indicator of deliverable bandwidth • High BW attained for small messages • N1/2 ~ 512 Bytes Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Other HPVM related projects • Approx. three hundreds groups have downloaded HPVM 1.2 at the last count • Some interesting research projects: • Low-level support for collective communication, OSU • FM with multicast (FM-MC), Vrije Universiteit, Amsterdam • Video server on demand, Univ. of Naples • Together with AM, U-Net and VMMC, FM has been the inspiration for the VIA industrial standard by Intel, Compaq, IBM • Latest release of HPVM is available from http://www-csag.ucsd.edu Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Current project: a HPVM-based Terabyte Storage Server • High performance parallel architectures increasingly associated with data-intensive applications: • NPACI large dataset applications requiring 100’s of GB: • Digital Sky Survey, Brain waves Analysis • digital data repositories, web indexing, multimedia servers: • Microsoft TerraServer, Altavista, RealPlayer/Windows Media servers (Audionet, CNN), streamed audio/video • genomic and proteomic research • large centralized data banks (GenBank, SwissProt, PDB, …) • Commercial terabyte systems (Storagetek, EMC) have price tags in the M$ range Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

The HPVM approach to a Terabyte Storage Server • Exploit commodity PC technologies to build a large (2 TB) and smart (50 Gflops) storage server • benefits: inexpensive PC disks, modern I/O bus • The cluster advantage: • 10 us communication latency vs 10 ms disk access latency provides opportunity for data declustering, redistribution, aggregation of I/O bandwidth • distributed buffering, data processing capability • scalable architecture • Integration issues: • efficient data declustering, I/O bus bandwidth allocation, remote/local programming interface, external connectivity Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

HPVM Cluster Myrinet Dept. of CSE, UCSD San Diego Supercomputing Center Global Picture • 1GB/s link between the two sites • 8 parallel Gigabit Ethernet connections • Ethernet cards installed in some of the nodes on each machine 1 GB/s link Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

The Hardware Highlights • Main features: • 1.6 TB = 64 * 25GB disks = $30K (UltraATA disks) • 1 GB/s of aggregate I/O bw (= 64 disks * 15 MB/s) • 45 GB RAM, 48 Gflop/s • 2.4 Gb/s Myrinet network • Challenges: • make available aggregate I/O bandwidth to applications • balance I/O load across nodes/disks • transport of TB of data in and out of the cluster Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

The Software Components Storage Resource Broker (SRB) used for interoperability with existing NPACI applications at SDSC Parallel I/O library (e.g. Panda, MPI-IO) to provide high performance I/O to code running on the cluster The HPVM suite provides support for fast communication, standard APIs on NT cluster SRB Panda MPI Put/Get Global Arrays Fast Messages Myrinet Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Related Work • User-level Fast Networking: • VIA list: AM (Fast Socket) [Culler92, Rodrigues97], U-Net (Unet/MM) [Eicken95, Welsh97], VMMC-2 [Li97] • RWCP PM [Tezuka96], BIP [Prylli97] • High-perfomance Cluster-based Storage: • UC Berkeley Tertiary Disks (Talagala98) • CMU Network-attached Devices [Gibson97], UCSB Active Disks (Acharya98) • UCLA Randomized I/O (RIO) server (Fabbrocino98) • UC Berkeley River system (Arpaci-Dusseau, unpub.) • ANL ROMIO and RIO projects (Foster, Gropp) Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Conclusions • HPVM provides all the necessary tools to transform a PC cluster into a production supercomputer • Projects like HPVM demonstrate: • level of maturity achieved so far by cluster technology with respect to conventional HPC utilization • springboard for further research on new uses of the technology • Efficient component integration at several levels key to performance: • tight coupling of the host and NIC crucial to minimize communication overhead • software layering on top of FM has exposed the need for a client-conscious design at the interface between layers Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

Future Work • Moving toward a more dynamic model of computation: • dynamic process creation, interaction between computations • communication group management • long term targets are dynamic communication, support for adaptive applications • Wide-area computing: • integration within computational grid infrastructure • LAN/WAN bridges, remote cluster connectivity • Cluster applications: • enhanced-functionality storage, scalable multimedia servers • Semi-regular network topologies Summer Institute on Advanced Computation Wright State University - August 20-23, 2000

High performance cluster technology: the HPVM experience