1 / 27

MPI and MPICH on Clusters

MPI and MPICH on Clusters. Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory. Outline. MPI implementations on clusters Unix NT Cluster-related activities at Argonne new cluster MPICH news and plans MPICH on NT Other stuff

weberj
Download Presentation

MPI and MPICH on Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MPI and MPICH on Clusters Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory

  2. Outline • MPI implementations on clusters • Unix • NT • Cluster-related activities at Argonne • new cluster • MPICH news and plans • MPICH on NT • Other stuff • scalable performance visualization • managing jobs and processes • parallel I/O

  3. MPI Implementations • MPI’s design for portability + performance has inspired a wide variety of implementations • Vendor implementations for their own machines • Implementations from software vendors • Freely available public implementations for a number of environments • Experimental implementations to explore research ideas

  4. MPI Implementations from Vendors • IBM for SP and RS/6000 workstations, OS/390 • Sun for Solaris systems, clusters of them • SGI for SGI Origin, Power Challenge, Cray T3E and C90 • HP for Exemplar, HP workstations • Compaq for (Digital) parallel machines • MPI Software Technology, for Microsoft Windows NT, Linux, Mac • Fujitsu for VPP (Pallas) • NEC • Hitachi • Genias (for NT)

  5. From the Public for the Public • MPICH • for many architectures, including clusters • http://www.mcs.anl.gov/mpi/mpich • new: http/www.mcs.anl.gov/mpi/mpich/mpich-nt • LAM • for clusters • http://lam.nd.edu • MPICH-based NT implementations • Aachen • Portugal • Argonne

  6. Experimental Implementations • Real-time (Hughes) • Special networks and protocols (not TCP) • MPI-FM (U. of I., UCSD) • MPI-BIP (Lyon) • MPI-AM (Berkeley) • MPI over VIA (Berkeley, Parma) • MPI-MBCF (Tokyo) • MPI over SCI (Oslo) • Wide-area networks • Globus (MPICH-G, Argonne) • Legion (U. of Virginia) • MetaMPI (Germany) • Ames Lab (highly optimized subset) • More implementations at http://www.mpi.nd.edu/lam

  7. Status of MPI-2 Implementations • Fujitsu: complete (from PALLAS) • NEC has I/O, complete early 2000, for SX • MPICH & LAM (C++ bindings and most of I/O) • LAM has parts of dynamic and one-sided • HP has part of one-sided • HP and SGI have most of I/O • Sun, Compaq are working on parts of MPI-2 • IBM has I/O, soon will have one-sided • EPCC did one-sided for Cray T3E • Experimental implementations (esp. one-sided)

  8. Cluster Activities at Argonne • New cluster - Chiba City • MPICH • Other software

  9. Chiba City 8 Computing Towns 256 Dual Pentium III systems 1 Storage Town 8 Xeon systems with 300G disk each 1 Visualization Town 32 Pentium III systems with Matrox G400 cards Cluster Management 12 PIII Mayor Systems 4 PIII Front End Systems 2 Xeon File Servers 3.4 TB disk High Performance Net 64-bit Myrinet Management Net Gigabit and Fast Ethernet Gigabit External Link

  10. Purpose: Scalable CS research Prototype application support System - 314 computers: 256 computing nodes, PIII 500MHz, 512M, 9G local disk 32 visualization nodes, PIII 500MHz, 512M, Matrox G200 8 storage nodes, 500 MHz Xeon, 512M, 300GB disk: 2.4TB total 10 town mayors, 1 city mayor, other management systems: PIII 500 MHz, 512M, 3TB disk Communications: 64-bit Myrinet computing net Switched fast/gigabit ethernet management net Serial control network Software Environment: Linux (based on RH 6.0), plus “install your own” OS support Compilers: GNU g++, PGI, etc Libraries and Tools: PETSc, MPICH, Globus, ROMIO, SUMMA3d, Jumpshot, Visualization, PVFS, HPSS, ADSM, PBS + Maui Scheduler Chiba City System Details

  11. Software Research on Clusters at ANL • Scalable Systems Management • Chiba City Management Model (w/LANL, LBNL) • MPI and Communications Software • GigaNet, Myrinet, ServerNetII • Data Management and Grid Services • Globus Services on Linux (w/LBNL, ISI) • Visualization and Collaboration Tools • Parallel OpenGL server (w/Princeton, UIUC) • vTK and CAVE Software for Linux Clusters • Scalable Media Server (FL Voyager Server on Linux Cluster) • Scalable Display Environment and Tools • Virtual Frame Buffer Software (w/Princeton) • VNC (ATT) modifications for ActiveMural • Parallel I/O • MPI-IO and Parallel Filesystems Developments (w/Clemson, PVFS)

  12. MPICH • Goals • Misconceptions about MPICH • MPICH architecture • the Abstract Device Interface (ADI) • Current work at Argonne on MPICH • Work above the ADI • Work below the ADI • A new ADI

  13. Goals of MPICH • As a research project: • to explore tradeoffs between performance and portability in the context of the MPI standard • to study algorithms applicable to MPI implementation • to investigate interfaces between MPI and tools • As a software project: • to provide a portable, freely available MPI to everyone • to give vendors and others a running start in the development of specialized MPI implementations • to provide a testbed for other research groups working on particular aspects of message passing

  14. Misconceptions About MPICH • It is pronounced (by its authors, at least) as “em-pee-eye-see-aitch”, not “em-pitch”. • It runs on networks of heterogeneous machines. • It runs MIMD parallel programs, not just SPMD. • It can use TCP, shared-memory, or both at the same time (for networks of SMP’s). • It runs over native communication on machines like the IBM SP and Cray T3E (not just TCP). • It is not for Unix only (new NT version). • It doesn’t necessarily poll (depends on device).

  15. MPICH Architecture MPI Routines Above the Device The Abstract Device Interface ADI Routines Channel Device Other Devices Below the Device ch_p4 ch_shmem ch_eui ch_NT t3e Globus sockets (shmem) (+) sockets sockets+shmem

  16. Recent Work Above the Device • Complete 1.2 compliance (in MPICH-1.2.0) • including even MPI_Cancel for sends • Better MPI derived datatype packing • can also be done below the device • MPI-2 C++ bindings • thanks to Notre Dame group • MPI-2 Fortran-90 module • permits use mpi instead of #include ‘mpif.h’ • extends work of Michael Hennecke • MPI-2 I/O • the ROMIO project • layers MPI I/O on any MPI implementation, file system

  17. Above the Device (continued) • Error message architecture • Instance-specific error reporting • “rank 789 invalid” rather than “Invalid rank” • Internationalization • German • Thai • Thread safety • Globus/NGI/collective

  18. Below the Device (continued) • Flow control for socket-based devices as defined by IMPI • Better multi-protocol for LINUX SMP’s • mmap solution portable among other Unixes not usable on Linux • Can’t use MAP_ANONYMOUS with MAP_SHARED! • SYSV solution works but can cause problems (race condition in deallocation) • works better than before in MPICH-1.2.0 • New NT implementation of the channel device

  19. MPICH for NT • open source: build with MS Visual C++ 6.0 and Digital Visual Fortran 6.0, or download dll • complete MPI 1.2, shares above-device code with Unix MPICH • correct implementation, passes test suites from ANL, IBM, Intel • not yet fully optimized • only socket device in current release • DCOM-based job launcher • working on other devices, ADI-3 implementation • http://www.mcs.anl.gov/~ashton/mpichbeta.html

  20. Preliminary Experiments with Shared Memory on NT

  21. ADI-3: A New Abstract Device Motivated by: • the fact that requirements of MPI-2 are inconsistent with ADI-2 design • thread safety • dynamic process management • one-sided operations • New capabilities of user-accessible hardware • LAPI • VIA/SIO (NGIO,FIO) • other network interfaces (Myrinet) • Desire for peak efficiency • top-to-bottom overhaul of MPICH • ADI-1: speed of implementation; ADI-2: portability

  22. Scheduler job mpirun process Runtime Environment Research • Fast startup of MPICH jobs via the mpd • Experiment with process manager, job manager, scheduler interface for parallel jobs

  23. Scalable Logfiles: SLOG • From IBM via AIX tracing • Using MPI profiling mechanism • Both automatic and user-defined states • Can support large logfiles, yet find and display sections quickly • Freely-available API for reading/writing SLOG files • Current format read by Jumpshot

  24. Jumpshot

  25. Parallel I/O for Clusters • ROMIO is an implementation of (almost all of) the I/O part of the MPI standard. • It can utilize multiple file systems and MPI implementations. • Included in MPICH, LAM, and MPI from SGI, HP, and NEC • A combination for Clusters: Linux, MPICH, and PVFS (Parallel Virtual File System from Clemson).

  26. Conclusion • There are many MPI implementations for clusters; MPICH is one. • MPI implementation, particularly for fast networks, remains an active research area • Argonne National Laboratory, all of whose software is open source, has a number of ongoing and new cluster-related activities • New cluster • MPICH • Tools

  27. Available in November

More Related