Supercomputing on Windows Clusters: Experience and Future Directions

Supercomputing on Windows Clusters: Experience and Future Directions Andrew A. Chien CTO, Entropia, Inc. SAIC Chair Professor Computer Science and Engineering, UCSD National Computational Science Alliance Invited Talk, USENIX Windows, August 4, 2000

Overview • Critical Enabling Technologies • The Alliance’s Windows Supercluster • Design and Performance • Other Windows Cluster Efforts • Future • Terascale Clusters • Entropia

External Technology Factors

MIPS R2000 (125) 100 Microprocessors MIPS R3000 (40) HP 7000 (15) Clock (ns) R4000 (10) 10 Cray 1S (12.5) R4400 (6.7) Cray X-MP (8.5) Cray Y-MP (6) Cray C90 (4.2) Vector supercomputers 1 1995 1975 1980 1985 1990 Microprocessor Performance • Micros: 10MF -> 100 MF -> 1GF -> 3GF -> 6GF (2001?) • => Memory system performance catching up (2.6 GB/s 21264 memory BW) DEC Alpha (5) X86/Alpha (1) Year Introduced Adapted from Baskett, SGI and CSC Vanguard

Killer Networks GigSAN/GigE: 110 MB/s • LAN: 10Mb/s -> 100Mb/s -> ? • SAN: 12MB/s -> 110MB/s (Gbps) -> 1100MB/s -> ? • Myricom, Compaq, Giganet, Intel,... • Network bandwidths limited by system internal memory bandwidths • Cheap and very fast communication hardware UW Scsi: 40 MB/s FastE: 12 MB/s Ethernet 1MB/s

Rich Desktop Operating Systems Environments • Desktop (PC) operating systems now provide • richest OS functionality • best program development tools • broadest peripheral/driver support • broadest application software/ISV support Clustering, Performance, Mass store, HP networking, Management, Availability, etc. Multiprocess Protection SMP support HD Storage Networks Graphical Interfaces Audio/Graphics Basic device access 1981 1985 1990 1995 1999

Critical Enabling Technologies

Critical Enabling Technologies • Cluster management and resource integration (“use like” one system) • Delivered communication performance • IP protocols inappropriate • Balanced systems • Memory bandwidth • I/O capability

The HPVM System • Goals • Enable tightly coupled and distributed clusters with high efficiency and low effort (integrated solution) • Provide usable access thru convenient standard parallel interfaces • Deliver highest possible performance and simple programming model

Delivered Communication Performance • Early 1990’s, Gigabit testbeds • 500Mbits (~60MB/s) @ 1 MegaByte packets • IP protocols not for Gigabit SAN’s • Cluster Objective: High performance communication to small and large messages • Performance Balance Shift: Networks faster than I/O, memory, processor

Fast Messages Design Elements • User-level network access • Lightweight protocols • flow control, reliable delivery • tightly-coupled link, buffer, and I/O bus management • Poll-based notification • Streaming API for efficient composition • Many generations 1994-1999 • [IEEE Concurrency, 6/97] • [Supercomputing ’95, 12/95] • Related efforts: UCB AM, Cornell U-Net,RWCP PM, Princeton VMMC/Shrimp, Lyon BIP => VIA standard

Improved Bandwidth • 20MB/s -> 200+ MB/s (10x) • Much of advance is software structure: API’s and implementation • Deliver *all* of the underlying hardware performance

Improved Latency • 100ms to 2ms overhead (50x) • Careful design to minimize overhead while maintaining throughput • Efficient event handling, fine-grained resource management and interlayer coordination • Deliver *all* of the underlying hardware performance

HPVM = Cluster Supercomputers • Turnkey Cluster Computing; Standard API’s • Network hardware and API’s increase leverage for users, achieve critical mass for system • Each involved new research challenges and provided deeper insights into the research issues • Drove continually better solutions (e.g. multi-transport integration, robust flow control and queue management) MPI Put/Get Global Arrays BSP Scheduling & Mgmt (LSF) HPVM 1.0 (8/1997) HPVM 1.2 (2/1999) - multi, dynamic, install HPVM 1.9 (8/1999) - giganet, smp Fast Messages Performance Tools Myrinet Server- Net Giganet VIA SMP WAN

HPVM Communication Performance • Delivers underlying performance for small messages, endpoints are the limits • 100MB/s at 1K vs 60MB/s at 1000K • >1500x improvement • N1/2 ~ 400 Bytes

HPVM/FM on VIA • FM Protocol/techniques portable to Giganet VIA • Slightly lower performance, comparable N1/2 • Commercial version: WSDI (stay tuned) • N1/2 ~ 400 Bytes

Unified Transfer and Notification (all transports) <space> • Solution: Uniform notify and poll (single Q representation) • Scalability: n into k (hash); arbitrary SMP size or number of NIC cards • Key: integrate variable-sized messages; achieve single DMA transfer • no pointer-based memory management, no special synchronization primitives, no complex computation • Memory format provides atomic notification in single contiguous memory transfer (bcopy or DMA) Fixed Size Frames Procs Variable Size Data Increasing Addresses Networks Fixed Size Trailer + Length/Flag

Integrated Notification Results • No polling or discontiguous access performance penalties • Uniform high performance which is stable over changes of configuration or the addition of new transports • no custom tuning for configuration required • Framework is scalable to large numbers of SMP processors and network interfaces Single TransportIntegrated Myrinet (latency) 8.3ms 8.4ms Myrinet (BW) 101MB/s 101MB/s Shared Memory (latency) 3.4ms 3.5ms Shared Memory (BW) 200+MB/s 200+MB/s

Supercomputer Performance Characteristics (11/99) MF/ProcFlops/ByteFlops/NetworkRT Cray T3E 1200 ~2 ~2,500 SGI Origin2000 500 ~0.5 ~1,000 HPVM NT Supercluster 600 ~8 ~12,000 IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K Beowulf (100Mbit) 600 ~50 ~200,000

Windows The NT Supercluster

Windows Clusters • Early prototypes in CSAG • 1/1997, 30P, 6GF • 12/1997, 64P, 20GF • Alliance’s Supercluster • 4/1998, 256P, 77GF • 6/1999, 256P*, 109GF

128 HP Kayak XU Dual PIII 550 MHz/1GB RAM Origin 550 MHz 300 MHz Using NT, Myrinet Interconnect, and HPVM NCSA’s Windows Supercluster Engineering Fluid Flow Problem #207 in Top 500 Supercomputing Sites D. Tafti, NCSA Rob Pennington (NCSA), Andrew Chien (UCSD)

Windows Cluster System Front-End Systems File servers LSF master Fast Ethernet FTP to Mass Storage Daily backups 128 GB Home 200 GB Scratch LSF BatchJob Scheduler Internet • Apps development • Job submission 128 Compute Nodes, 256 CPUs Windows NT, Myrinet and HPVM 128 Dual 550 MHz Systems Infrastructure and Development Testbeds Windows 2K and NT 8 4p 550 + 32 2p 300 + 8 2p 333 (courtesy Rob Pennington, NCSA)

Example Application Results • MILC – QCD • Navier-Stokes Kernel • Zeus-MP – Astrophysics CFD • Large-scale Science and Engineering codes • Comparisons to SGI O2K and Linux clusters

12 IA-32/Win NT, 300 MHz PII 250 MHz SGI O2K 10 T3E 900 IA-32/Win NT 550MHz Xeon 8 6 GFLOPs 4 2 0 0 50 100 Processors MILC Performance Src: D. Toussaint and K. Orginos, Arizona

Zeus-MP (Astrophysics CFD)

2D Navier Stokes Kernel Source: Danesh Tafti, NCSA

18.3 GB Minutesort World Record Applications with High Performance on Windows Supercluster • Zeus-MP (256P, Mike Norman) • ISIS++ (192P, Robert Clay) • ASPCG (256P, Danesh Tafti) • Cactus (256P, Paul Walker/John Shalf/Ed Seidel) • MILC QCD (256P, Lubos Mitas) • QMC Nanomaterials (128P, Lubos Mitas) • Boeing CFD Test Codes, CFD Overflow (128P, David Levine) • freeHEP (256P, Doug Toussaint) • ARPI3D (256P, weather code, Dan Weber) • GMIN (L. Munro in K. Jordan) • DSMC-MEMS (Ravaioli) • FUN3D with PETSc (Kaushik) • SPRNG (Srinivasan) • MOPAC (McKelvey) • Astrophysical N body codes (Bode) • => Little code retuning and quickly running ... • Parallel Sorting (Rivera – CSAG),

MinuteSort • Sort max data disk-to-disk in 1 minute • “Indy sort” • fixed size keys, special sorter, and file format • HPVM/Windows Cluster winner for 1999 (10.3GB) and 2000 (18.3GB) • Adaptation of Berkeley NOWSort code (Arpaci and Dusseau) • Commodity configuration ($$ not a metric) • PC’s, IDE disks, Windows • HPVM and 1Gbps Myrinet

5 Kayak Netserver 3 4 2 2 1 5 4 5 4 4 1 2 2 3 5 Kayak MinuteSort Architecture HPVM & 1Gbps Myrinet 32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks 32 HP Netservers 2 x 16GB SCSI disks (Luis Rivera UIUC, Xianan Zhang UCSD)

Sort Scaling • Concurrent read/bucket-sort/communicate is bottleneck – faster I/O infrastructure required (busses and memory, not disks)

MinuteSort Execution Time

Gossip: “Windows platforms are not reliable” Larger systems => intolerably low MTBF Our Experience: “Nodes don’t crash” Application runs of 1000s of hours Node failure means an application failure; effectively not a problem Hardware Short term: Infant mortality (1 month burn-in) Long term ~1 hardware problem/100 machines/month Disks, network interfaces, memory No processor or motherboard problems. Reliability

Windows Cluster Usage • Lots of large jobs • Runs up to ~14,000 hours (64p * 9 days)

Other Large Windows Clusters • Sandia’s Kudzu Cluster (144 procs, 550 disks, 10/98) • Cornell’s AC3 Velocity Cluster (256 procs, 8/99) • Others (sampled from vendors) • GE Research Labs (16, Scientific) • Boeing (32, Scientific) • PNNL (96, Scientific) • Sandia (32, Scientific) • NCSA (32, Scientific) • Rice University (16, Scientific) • U. of Houston (16, Scientific) • U. of Minnesota (16, Scientific) • Oil & Gas (8,Scientific) • Merrill Lynch (16, Ecommerce) • UIT (16, ASP/Ecommerce)

The AC3 Velocity • 64 Dell PowerEdge 6350 Servers • Quad Pentium III 500 MHz/2 MB Cache Processors (SMP) • 4 GB RAM/Node • 50 GB Disk (RAID 0)/Node • GigaNet Full Interconnect • 100 MB/Sec Bandwidth between any 2 Nodes • Very Low Latency • 2 Terabytes Dell PowerVault 200S Storage • 2 Dell PowerEdge 6350 Dual Processor File Servers • 4 PowerVault 200S Units/File Server • 8 36 GB/Disk Drives/PowerVault 200S • Quad Channel SCSI Raid Adapter • 180 MB/sec Sustained Throughput/ Server • 2 Terabyte PowerVault 130T Tape Library • 4 DLT 7000 Tape Drives • 28 Tape Capacity #381 in Top 500 Supercomputing Sites (courtesy David A. Lifka, Cornell TC)

Recent AC3 Additions • 8 Dell PowerEdge 2450 Servers (Serial Nodes) • Pentium III 600 MHz/512 KB Cache • 1 GB RAM/Node • 50 GB Disk (RAID 0)/Node • 7 Dell PowerEdge 2450 Servers (First All NT Based AFS Cell) • Dual Processor Pentium III 600 MHz/512 KB Cache • 1 GB RAM/Node Fileservers, 512 MB RAM/Node Database servers • 1 TB SCSI based RAID 5 Storage • Cross platform filesystem support • 64 Dell PowerEdge 2450 Servers (Protein Folding, Fracture Analysis) • Dual Processor Pentium III 733 Mhz/256 KB Cache • 2 GB RAM/Node • 27 GB Disk (RAID 0)/Node • Full Giganet Interconnect • 3 Intel ES6000 & 1 ES1000 Gigabit switches • Upgrading our Server Backbone network to Gigabit Ethernet (courtesy David A. Lifka, Cornell TC)

AC3 Goals • Only commercially supported technology • Rapid spinup and spinout • Package technologies for vendors to sell as integrated systems • => All of commercial packages were moved from SP2 to Windows, all users are back and more! • Users: “I don’t do windows” => • “I’m agnostic about operating systems, and just focus on getting my work done.”

Protein Folding Reaction path study of lig and diffusion in leghemoglobin. The ligand is CO (white) and it is moving from the binding site, the heme pocket, to the protein exterior. A study by Weislaw Nowak and Ron Elber. The cooperative motion of ion and water through the gramicidin ion channel. The effective quasi-article that permeates through the channel includes eight water molecules and the ion. Work of Ron Elber with Bob Eisenberg, Danuta Rojewska and Duan Pin. http://www.tc.cornell.edu/reports/NIH/resource/CompBiologyTools/ (courtesy David A. Lifka, Cornell TC)

Machine System CPU CPU speed [MHz] compiler Energy evaluations per second Blue Horizon (SP San Diego) AIX 4 Power3 222 xlf 44.3 Linux cluster Linux 2.2 PentiumIII 650 PGF 3.1 59.1 Velocity (CTC) Win 2000 PentiumIII Xeon 500 df v6.1 46.0 Velocity+ (CTC) Win 2000 PentiumIII 733 df v6.1 59.2 Machine System CPU CPU speed [MHz] compiler Energy evaluations per second Blue Horizon (SP San Diego) AIX 4 Power3 222 xlf 15.0 Linux cluster Linux 2.2 PentiumIII 650 PGF 3.1 21.0 Velocity (CTC) Win 2000 PentiumIII Xeon 500 df v6.1 16.9 Velocity+ (CTC) Win 2000 PentiumIII 733 df v6.1 22.4 Protein Folding Per/Processor Performance Results on different computers foraprotein structures: Results on different computers for (a /b or b proteins): (courtesy David A. Lifka, Cornell TC)

AC3 Corporate Members • Air Products and Chemicals • Candle Corporation • Compaq Computer Corporation • Conceptual Reality Presentations • Dell Computer Corporation • Etnus, Inc. • Fluent, Inc. • Giganet, Inc. • IBM Corporation • ILOG, Inc. • Intel Corporation • KLA-Tencor Corporation • Kuck & Associates, Inc. • Lexis-Nexis • MathWorks, Inc. • Microsoft Corporation • MPI Software Technologies, Inc. • Numerical Algorithms Group • Portland Group, Inc. • Reed Elsevier, Inc. • Reliable Network Solutions, Inc. • SAS Institute, Inc. • Seattle Lab, Inc. • Visual Numerics, Inc. • Wolfram Research, Inc. (courtesy David A. Lifka, Cornell TC)

Windows Cluster Summary • Good performance • Lots of Applications • Good reliability • Reasonable Management complexity (TCO) • Future is bright; uses are proliferating!

Windows Cluster Resources • NT Supercluster, NCSA • http://www.ncsa.uiuc.edu/General/CC/ntcluster/ • http://www-csag.ucsd.edu/projects/hpvm.html • AC3 Cluster, TC • http://www.tc.cornell.edu/UserDoc/Cluster/ • University of Southampton • http://www.windowsclusters.org/ • => application and hardware/software evaluation • => many of these folks will work with you on deployment

Tools and Technologies for Building Windows Clusters • Communication Hardware • Myrinet, http://www.myri.com/ • Giganet, http://www.giganet.com/ • Servernet II, http://www.compaq.com/ • Cluster Management and Communication Software • LSF, http://www.platform.com/ • Codeine, http://www.gridware.net/ • Cluster CoNTroller, MPI, http://www.mpi-softtech.com/ • Maui Scheduler http://www.cs.byu.edu/ • MPICH, http://www-unix.mcs.anl.gov/mpi/mpich/ • PVM, http://www.epm.ornl.gov/pvm/ • Microsoft Cluster Info • Win2000 http://www.microsoft.com/windows2000/ • MSCS,http://www.microsoft.com/ntserver/ntserverenterprise/exec/overview/clustering.asp

Future Directions Terascale Clusters Entropia

A Terascale Cluster 10+ Teraflops in 2000? • NSF currently running a $36M Terascale competition • Budget could buy • an Itanium cluster (3000+ processors) • ~3TB of main memory • > 1.5Gbps high speed network interconnect ? #1 in Top 500 ? Supercomputing Sites

Entropia: Beyond Clusters COTS, SHV enable larger, cheaper, faster systems Supercomputers (MPP’s) to… Commodity Clusters (NT Supercluster) to… Entropia

Internet Computing • Idea: Assemble large numbers of idle PC’s in people’s homes, offices, into a massive computational resource • Enabled by broadband connections, fast microprocessors, huge PC volumes

Unprecedented Power • Entropia network: ~30,000 machines (and growing fast!) • 100,000, 1Ghz => 100 TeraOp system • 1,000,000, 1Ghz => 1,000 TeraOp system (1 PetaOp) • IBM ASCI White (12 TeraOp, 8K processors, $110 Million system)

Why Participate: Cause Computing!

Supercomputing on Windows Clusters: Experience and Future Directions

Supercomputing on Windows Clusters: Experience and Future Directions

Presentation Transcript

Windows Storage Directions: Windows Vista And Beyond

Oracle Database on Windows: Best Practices and Future Directions

The Future of Supercomputing

Views on Success and Future Directions

Windows Client Directions

Oracle Database on Windows: Best Practices and Future Directions

The Future of Supercomputing

Future Directions

Future Directions

Future Directions

Future Directions

FUTURE DIRECTIONS ON THE VLBA

Deployment and Future Directions

Future Directions*

Supercomputing: Directions in Technology, Architecture and Applications

Future Directions

Future Directions

Future directions

Conclusions and Future Directions

Future Directions