MPI and MPICH on Clusters

MPI and MPICH on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory

Outline • MPI implementations on clusters • Unix • NT • Cluster-related activities at Argonne • new cluster • MPICH news and plans • MPICH on NT • Other stuff • scalable performance visualization • managing jobs and processes • parallel I/O

MPI Implementations • MPI’s design for portability + performance has inspired a wide variety of implementations • Vendor implementations for their own machines • Implementations from software vendors • Freely available public implementations for a number of environments • Experimental implementations to explore research ideas

MPI Implementations from Vendors • IBM for SP and RS/6000 workstations, OS/390 • Sun for Solaris systems, clusters of them • SGI for SGI Origin, Power Challenge, Cray T3E and C90 • HP for Exemplar, HP workstations • Compaq for (Digital) parallel machines • MPI Software Technology, for Microsoft Windows NT, Linux, Mac • Fujitsu for VPP (Pallas) • NEC • Hitachi • Genias (for NT)

From the Public for the Public • MPICH • for many architectures, including clusters • http://www.mcs.anl.gov/mpi/mpich • new: http/www.mcs.anl.gov/mpi/mpich/mpich-nt • LAM • for clusters • http://lam.nd.edu • MPICH-based NT implementations • Aachen • Portugal • Argonne

Experimental Implementations • Real-time (Hughes) • Special networks and protocols (not TCP) • MPI-FM (U. of I., UCSD) • MPI-BIP (Lyon) • MPI-AM (Berkeley) • MPI over VIA (Berkeley, Parma) • MPI-MBCF (Tokyo) • MPI over SCI (Oslo) • Wide-area networks • Globus (MPICH-G, Argonne) • Legion (U. of Virginia) • MetaMPI (Germany) • Ames Lab (highly optimized subset) • More implementations at http://www.mpi.nd.edu/lam

Status of MPI-2 Implementations • Fujitsu: complete (from PALLAS) • NEC has I/O, complete early 2000, for SX • MPICH & LAM (C++ bindings and most of I/O) • LAM has parts of dynamic and one-sided • HP has part of one-sided • HP and SGI have most of I/O • Sun, Compaq are working on parts of MPI-2 • IBM has I/O, soon will have one-sided • EPCC did one-sided for Cray T3E • Experimental implementations (esp. one-sided)

Cluster Activities at Argonne • New cluster - Chiba City • MPICH • Other software

Chiba City 8 Computing Towns 256 Dual Pentium III systems 1 Storage Town 8 Xeon systems with 300G disk each 1 Visualization Town 32 Pentium III systems with Matrox G400 cards Cluster Management 12 PIII Mayor Systems 4 PIII Front End Systems 2 Xeon File Servers 3.4 TB disk High Performance Net 64-bit Myrinet Management Net Gigabit and Fast Ethernet Gigabit External Link

Purpose: Scalable CS research Prototype application support System - 314 computers: 256 computing nodes, PIII 500MHz, 512M, 9G local disk 32 visualization nodes, PIII 500MHz, 512M, Matrox G200 8 storage nodes, 500 MHz Xeon, 512M, 300GB disk: 2.4TB total 10 town mayors, 1 city mayor, other management systems: PIII 500 MHz, 512M, 3TB disk Communications: 64-bit Myrinet computing net Switched fast/gigabit ethernet management net Serial control network Software Environment: Linux (based on RH 6.0), plus “install your own” OS support Compilers: GNU g++, PGI, etc Libraries and Tools: PETSc, MPICH, Globus, ROMIO, SUMMA3d, Jumpshot, Visualization, PVFS, HPSS, ADSM, PBS + Maui Scheduler Chiba City System Details

Software Research on Clusters at ANL • Scalable Systems Management • Chiba City Management Model (w/LANL, LBNL) • MPI and Communications Software • GigaNet, Myrinet, ServerNetII • Data Management and Grid Services • Globus Services on Linux (w/LBNL, ISI) • Visualization and Collaboration Tools • Parallel OpenGL server (w/Princeton, UIUC) • vTK and CAVE Software for Linux Clusters • Scalable Media Server (FL Voyager Server on Linux Cluster) • Scalable Display Environment and Tools • Virtual Frame Buffer Software (w/Princeton) • VNC (ATT) modifications for ActiveMural • Parallel I/O • MPI-IO and Parallel Filesystems Developments (w/Clemson, PVFS)

MPICH • Goals • Misconceptions about MPICH • MPICH architecture • the Abstract Device Interface (ADI) • Current work at Argonne on MPICH • Work above the ADI • Work below the ADI • A new ADI

Goals of MPICH • As a research project: • to explore tradeoffs between performance and portability in the context of the MPI standard • to study algorithms applicable to MPI implementation • to investigate interfaces between MPI and tools • As a software project: • to provide a portable, freely available MPI to everyone • to give vendors and others a running start in the development of specialized MPI implementations • to provide a testbed for other research groups working on particular aspects of message passing

Misconceptions About MPICH • It is pronounced (by its authors, at least) as “em-pee-eye-see-aitch”, not “em-pitch”. • It runs on networks of heterogeneous machines. • It runs MIMD parallel programs, not just SPMD. • It can use TCP, shared-memory, or both at the same time (for networks of SMP’s). • It runs over native communication on machines like the IBM SP and Cray T3E (not just TCP). • It is not for Unix only (new NT version). • It doesn’t necessarily poll (depends on device).

MPICH Architecture MPI Routines Above the Device The Abstract Device Interface ADI Routines Channel Device Other Devices Below the Device ch_p4 ch_shmem ch_eui ch_NT t3e Globus sockets (shmem) (+) sockets sockets+shmem

Recent Work Above the Device • Complete 1.2 compliance (in MPICH-1.2.0) • including even MPI_Cancel for sends • Better MPI derived datatype packing • can also be done below the device • MPI-2 C++ bindings • thanks to Notre Dame group • MPI-2 Fortran-90 module • permits use mpi instead of #include ‘mpif.h’ • extends work of Michael Hennecke • MPI-2 I/O • the ROMIO project • layers MPI I/O on any MPI implementation, file system

Above the Device (continued) • Error message architecture • Instance-specific error reporting • “rank 789 invalid” rather than “Invalid rank” • Internationalization • German • Thai • Thread safety • Globus/NGI/collective

Below the Device (continued) • Flow control for socket-based devices as defined by IMPI • Better multi-protocol for LINUX SMP’s • mmap solution portable among other Unixes not usable on Linux • Can’t use MAP_ANONYMOUS with MAP_SHARED! • SYSV solution works but can cause problems (race condition in deallocation) • works better than before in MPICH-1.2.0 • New NT implementation of the channel device

MPICH for NT • open source: build with MS Visual C++ 6.0 and Digital Visual Fortran 6.0, or download dll • complete MPI 1.2, shares above-device code with Unix MPICH • correct implementation, passes test suites from ANL, IBM, Intel • not yet fully optimized • only socket device in current release • DCOM-based job launcher • working on other devices, ADI-3 implementation • http://www.mcs.anl.gov/~ashton/mpichbeta.html

Preliminary Experiments with Shared Memory on NT

ADI-3: A New Abstract Device Motivated by: • the fact that requirements of MPI-2 are inconsistent with ADI-2 design • thread safety • dynamic process management • one-sided operations • New capabilities of user-accessible hardware • LAPI • VIA/SIO (NGIO,FIO) • other network interfaces (Myrinet) • Desire for peak efficiency • top-to-bottom overhaul of MPICH • ADI-1: speed of implementation; ADI-2: portability

Scheduler job mpirun process Runtime Environment Research • Fast startup of MPICH jobs via the mpd • Experiment with process manager, job manager, scheduler interface for parallel jobs

Scalable Logfiles: SLOG • From IBM via AIX tracing • Using MPI profiling mechanism • Both automatic and user-defined states • Can support large logfiles, yet find and display sections quickly • Freely-available API for reading/writing SLOG files • Current format read by Jumpshot

Jumpshot

Parallel I/O for Clusters • ROMIO is an implementation of (almost all of) the I/O part of the MPI standard. • It can utilize multiple file systems and MPI implementations. • Included in MPICH, LAM, and MPI from SGI, HP, and NEC • A combination for Clusters: Linux, MPICH, and PVFS (Parallel Virtual File System from Clemson).

Conclusion • There are many MPI implementations for clusters; MPICH is one. • MPI implementation, particularly for fast networks, remains an active research area • Argonne National Laboratory, all of whose software is open source, has a number of ongoing and new cluster-related activities • New cluster • MPICH • Tools

Available in November

MPI and MPICH on Clusters

MPI and MPICH on Clusters

Presentation Transcript

MPI—The Best High Performance Programming Model for Clusters and Grids

FT-MPICH : Providing fault tolerance for MPI parallel applications

Experiences of Grid Enabled MPI Implementation named MPICH-GX with Atmospheric Applications

MPICH- V : Toward a scalable fault tolerant MPI for V olatile nodes

MPICH-G2: A Grid-Enabled MPI

MPI on WinNT-Clusters

MPICH-GF

Optimizing Threaded MPI Execution on SMP Clusters

Kernel-assisted MPI Communication on Multi-core Clusters

MPI on the Grid

Programming Clusters using Message-Passing Interface (MPI)

More on MPI

Cross-site running on TeraGrid using MPICH-G2

Groups, Clusters and Clusters of Clusters

MPI on the Grid

Programming Clusters using Message-Passing Interface (MPI)

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Optimizing Threaded MPI Execution on SMP Clusters

FT-MPICH : Providing fault tolerance for MPI parallel applications