A Strategy for the Future of High Performance Computing?

Extreme Linux A Strategy for the Future of High Performance Computing? Advanced Computing Laboratory Los Alamos National Laboratory Pete Beckman

Observations:The US Supercomputing Industry • All US high-performance vendors are building clusters of SMPs (with the exception of Tera) • Each company, IBM, SGI, Compaq, HP, and SUN has a different version of Unix • Each company attempts to scale system software designed for database, internet, technical and servers • This fractured market forces 5 different parallel file systems, fast message implementations, etc • Supercomputer companies tend to go out of business Advanced Computing Laboratory Los Alamos National Laboratory

New Limitations People used to say: “The number of Tflops available is limited only by the amount of money you wish to spend” The Reality: We are at a point where our ability to build machines from components exceeds our ability to admin, program and run them But we do it anyway. Many large clusters are being installed... Advanced Computing Laboratory Los Alamos National Laboratory

Scalable System Software is currently the weak linkSoftware for Tflop clusters of SMPs is hard • System administration, configuration, booting, management, & monitoring • Scalable smart NIC messaging (zero copy) • Cluster/Global/Parallel File System • Job queuing and running • I/O (scratch, prefetch, NASD) • Fault tolerance and on-the-fly reconfiguration Advanced Computing Laboratory Los Alamos National Laboratory

Why use Linux for clusters of SMPs, and as a basis for system software research? Linux is a lot of fun (Shagadelic, Baby!) • The OS for scalable clusters needs more research • Open Source! (it’s more then just geek chic) • No lawyers, no NDAs, no worries mate! • Visible code improves faster • The whole environment, or just the mods can be distributed • Scientific collaboration is just an URL away... • Small, well designed, stable, mature, kernel • ~240K lines of code without device drivers • /proc filesystem and dynamically loadable modules • The OS is extendable, optimizable, tunable Did I mention no lawyers? Advanced Computing Laboratory Los Alamos National Laboratory

Isn’t Open Source hype?Do you really need it? A very quick example Supermon and Superview: High Performance Cluster Monitoring Tools Ron Minnich, Karen Reid, Matt Sottile Advanced Computing Laboratory Los Alamos National Laboratory

The problem: get really fast stats from a very large cluster • Monitor hundreds of nodes at rates up to 100 Hz • Monitor at 10Hz without significant impact on the application • Monitor hardware performance counters • Collect a wide range of kernel information (disk blocks, memory, interrupts, etc) Advanced Computing Laboratory Los Alamos National Laboratory

Solution • Modify the kernel so all the parameters can be grabbed without going though /proc • Tightly coupled clusters can get real-time monitoring stats. • This is not of general use to the desktop, and web server markets • Stats for 100 nodes takes about 20 ms Advanced Computing Laboratory Los Alamos National Laboratory

Superview: the Java tool for Supermon Advanced Computing Laboratory Los Alamos National Laboratory

Scalable Linux System Software Where should we concentrate our efforts? Some areas for improvement…. Advanced Computing Laboratory Los Alamos National Laboratory

Software: The hard partLinux environments (page 1) • Compilers • F90 (PGI, Absoft, Compaq) • F77 (GNU, PGI, Absoft, Compaq, Fujitsu) • HPF (PGI, Compaq?) • C/C++ (PGI, KAI, GNU, Compaq, Fujitsu) • OpenMP (PGI) • Metrowerks Code Warrior for C, C++, (Fortran?) • Debuggers • Totalview… maybe, real soon now, almost? • gdb, DDD, etc. Advanced Computing Laboratory Los Alamos National Laboratory

Software: The hard partLinux environments (page 2) • Message Passing • MPICH, PVM, MPI MSTI, Nexus • OS Bypass: • ST, FM, AM, PM, GM, VIA, Portals, etc • Fast Interconnects: Myrinet, GigE, HiPPI, SCI • Shared Memory Programming • Pthreads, Tulip-Threads, etc. • Parallel Performance Tools • TAU, Vampir, PGI PGProf, Jumpshot, etc Advanced Computing Laboratory Los Alamos National Laboratory

Software: The hard partLinux environments (page 3) • File Systems & I/O • e2fs (native),NFS • PVFS, Coda, GFS • MPI-IO, ROMIO • Archival Storage • HPSS & ADSM clients • Job Control • LSF, PBS,Maui Advanced Computing Laboratory Los Alamos National Laboratory

Software: The hard partLinux environments (page 4) • Libraries and Frameworks • BLAS, OVERTURE, POOMA, Atlas • Alpha math libraries (Compaq) • System Administration • Building and booting tools • Cfengine • Monitoring and management tools • Configuration database • SGI Project Accounting Advanced Computing Laboratory Los Alamos National Laboratory

Quick Summary Software for Linux clustersA report card (current status) ………………………………...………..A …………………………..……...I ………………………….….…..A- ……….…………..……..A …………...………..C+ ………………….………...………..D …………………………….……..C ………………………………...……..B- ……………………………………..B Compilers Parallel debuggers Message passing Shared memory prog. Parallel performance tools File Systems Archival Storage Job Control Math Libraries Advanced Computing Laboratory Los Alamos National Laboratory

Summary of the most important areas • First Priority • Cluster management, administration, images, monitoring, etc • Cluster/parallel/global file systems • Continued work on scalable messaging • Faster, more scalable SMP • Virtual memory optimized for HPC • TCP/IP improvements • Wish List • NIC boot, BIOS NVRAM, Serial console • OS bypass standards in the kernel • Tightly-coupled scheduling, accounting • Newest Drivers Advanced Computing Laboratory Los Alamos National Laboratory

Honest cluster costs: publish the numbers • How many sysadmins and programmers are we required for support? • What are the service and replacement costs? • How much was hardware integration? • How many users can you support and at what levels? • How much was the hardware? Advanced Computing Laboratory Los Alamos National Laboratory

Tera-Scale SMP Cluster Architecture Network Attached Secure Disks Gigabit Multistage Interconnection Fabric Compute Nodes Control Node Control Node Unit Gigabit Ethernet Advanced Computing Laboratory Los Alamos National Laboratory

Let someone else put it together • Compaq • Dell • Penguin Computing • Alta Tech • VA Linux • DCG • Paralogic • Microway Ask about support Advanced Computing Laboratory Los Alamos National Laboratory

Cluster BenchmarkingLies, Damn Lies, and the Top500 Vendor Published Linpack, Latency, and Bandwidth numbers are worthless • Make MPI zero-byte messaging a special case (improves latency numbers) • Convert multiply flops to addition, recount flops • Hire a Linpack consultant to help you achieve “the number” the vendor promised • “We unloaded the trucks, and 24hrs later, we calculated the size of the galaxy in acres.” • For $15K and 3 rolls of duct tape I built a supercomputer in my cubicle…. Advanced Computing Laboratory Los Alamos National Laboratory

Plug-in Frameworkfor Cluster Benchmarks Advanced Computing Laboratory Los Alamos National Laboratory

MPI Message Matching Advanced Computing Laboratory Los Alamos National Laboratory

Advanced Computing Laboratory Los Alamos National Laboratory

Conclusions • Lots of Linux clusters will be at SC99 • The Big 5 vendors do not have the critical mass to develop the system software for multi-teraflop clusters • The HPC community (labs, vendors, universities, etc.) needs to work together • The hardware consolidation is nearly over, the software consolidation is on its way • A Linux-based “commodity” Open Source strategy could provide a mechanism for: • open vendor collaboration • academic and laboratory participation • one Open Source software environment Advanced Computing Laboratory Los Alamos National Laboratory

News and Announcements: • The next Extreme Linux conference will be in Williamsburg in October. The call for papers will be out soon, start preparing those technical papers… • There will be several cluster tutorials at SC99. Remy Evard, Bill Saphir, and Pete Beckman will be running one focused on system administration and user environment for large clusters. Advanced Computing Laboratory Los Alamos National Laboratory

A Strategy for the Future of High Performance Computing?