160 likes | 312 Views
Distributed Parallel Processing – MPICH-VMI. Avneesh Pant. VMI. What is VMI? Virtual Machine Interface High performance communication middleware Abstracts underlying communication network What is MPICH-VMI
E N D
Distributed Parallel Processing – MPICH-VMI Avneesh Pant
VMI • What is VMI? • Virtual Machine Interface • High performance communication middleware • Abstracts underlying communication network • What is MPICH-VMI • MPI library based on MPICH 1.2 from Argonne that uses VMI for underlying communication
Features • Communication over heterogeneous networks • Infiniband, Myrinet, TCP, Shmem supported • Underlying networks selected at runtime • Enables cross-site jobs over compute grids • Optimized point-to-point communication • Higher level MPICH protocols (eager and rendezvous) implemented over RDMA Put and Get primitives • RDMA emulated on networks without native RDMA support (TCP) • Extensive support for profiling • Profiling counters collect information about communication pattern of the application • Profiling information logged to a databases during MPI_Finalize • Profile Guided Optimization (PGO) framework uses profile databases to optimize subsequent executions of the application
Features • Hiding pt-to-pt communication latency • RMDA get protocol very useful in overlapping communication and computation • PGO infrastructure maps MPI processes to nodes to take advantage of heterogeneity of underlying network, effectively hiding latencies • Optimized Collectives • RDMA based collectives (e.g., MPI_Barrier) • Multicast based collectives (e,g., MPI_Bcast experimental implementation using multicast) • Topology aware collectives (currently MPI_Bcast, MPI_Reduce, MPI_Allreduce supported)
MPI on Teragrid • MPI flavors available on Teragrid • MPICH-GM • MPICH-G2 • MPICH-VMI 1 • Deprecated! Was part of CTSS v1 • MPICH-VMI 2 • Available as part of CTSS v2 and v3 • All are part of CTSS • Which one to use? • We are biased!
MPI on Teragrid • MPI Designed for • MPICH-GM -> Single site runs using myrinet • MPICH-G2 -> Running across globus grids • MPICH-VMI2 -> Scale out seamlessly from single site to across grid • Currently need to keep two separate executables • Single site using MPICH-GM and Grid job using MPICH-G2 • MPICH-VMI2 allows you to use the same executable with comparable or better performance
Using MPICH-VMI • Two flavors of MPICH-VMI2 on Teragrid • GCC compiled library • Intel compiled library • Recommended not to mix them together • CTSS defines keys for each compiled library • GCC: mpich-vmi-2.1.0-1-gcc-3-2 • Intel: mpich-vmi-2.1.0-1-intel-8.0
Setting the Environment • To use MPICH VMI 2.1 • $ soft add +mpich-vmi-2.1.0-1-{gcc-3-2 | intel-8.0} • To preserve VMI 2.1 environment across sessions, add • “+mpich-vmi-2.1.0-1-{gcc-3-2 | intel-8.0}” to the .soft file in your home directory • Intel 8.1 is also available at NCSA. Other sites do not have Intel 8.1 completely installed yet. • Softenv brings in the compiler wrapper scripts into your environment • mpicc and mpiCC for C and C++ codes • mpif77 and mpif90 for F77 and F90 codes • Some underlying compilers such as GNU compiler suite do not support F90. Use “mpif90 –show” to determine underlying compiler being used.
Compiling with MPICH-VMI • The compiler scripts are wrappers that include all MPICH-VMI specific libraries and paths • All underlying compiler switches are supported and passed to the compiler • eg. mpicc hello.c –o hello • The MPICH-VMI library by default is compiled with debug symbols.
Running with MPICH-VMI • mpirun script is available for launching jobs • Supports all standard arguments in addition to MPICH-VMI specific arguments • mpirun uses ssh, rsh and MPD for launching jobs. Default is MPD. • Provides automatic selection/failover • If MPD ring not available, falls back to ssh/rsh • Supports standard way to run jobs • mpirun –np <# of procs> -machinefile <nodelist file> <executable> <arguments> • -machinefile argument not needed when running within PBS or LSF environment • Can select network to use at runtime by specifying • -specfile <network> • Supported networks are myrinet, tcp and xsite-myrinet-tcp • Default network on Teragrid is Myrinet • Recommend to always specify network explicitly using –specfile switch
Running with MPICH-VMI • MPICH-VMI 2.1 specific arguments related to three broad categories • Parameters for runtime tuning • Parameters for launching GRID jobs • Parameters for controlling profiling of job • mpirun –help option to list all tunable parameters • All MPICH-VMI 2.1 specific parameters are optional. GRID jobs require some parameters to be set. • To run a simple job within a Teragrid cluster • mpirun –np 4 /path/to/hello • mpirun –np 4 –specfile myrinet /path/to/hello • Within PBS $PBS_NODEFILE contains the path to the nodes allocated at runtime • mpirun –np <# procs> –machinefile $PBS_NODEFILE /path/to/hello • For cross-site jobs, additional arguments required (discussed later)
For Detecting/Reporting Errors • Verbosity switches • -v Verbose Level 1. Output VMI startup messages and make MPIRUN verbose. • -vv Verbose Level 2. Additionally output any warning messages. • -vvv Verbose Level 3. Additionally output any error messages. • -vvvv Verbose Level 10. Excess Debug. Useful only for developers of MPICH-VMI and submitting crash dumps.
Running Inter Site Jobs • A MPICH-VMI GRID job consists of one or more subjobs • A subjob is launched on each site using individual mpirun commands. The specfile selected should be one of the xsite network transports (xsite-mst-tcp or xsite-myrinet-tcp). • The higher performance SAN (Infiniband or Myinet) is used for intra site communication. Cross site communication uses TCP automatically • In Addition to Intra Site Parameters all Inter Site Runs Must Specify the same Grid Specific Parameters • A Grid CRM Must be Available on the Network to Synchronize Subjobs • Grid CRM on Teragrid is available at tg-master2.ncsa.uiuc.edu • No reason why any other site can’t host their own • In fact, you can run one on your own desktop! • Grid Specific Parameters • -grid-procs Specifies the total number of processes in the job. –np parameter to mpirun still specifies the number of processes in the subjob • -grid-crm Specifies the host running the grid CRM to be used for subjob synchronization. • -key Alphanumeric string that uniquely identifies the grid job. This should be the same for all subjobs!
Running Inter Site Jobs • Running xsite across SDSC (2 procs) and NCSA (6 procs) • @SDSC: mpirun -np 2 grid-procs 8 -key myxsitejob -specfile xsite-myrinet-tcp –grid-crm tg-master2.ncsa.teragrid.org cpi • @NCSA: mpirun -np 6 grid-procs 8 -key myxsitejob -specfile xsite-myrinet-tcp –grid-crm tg-master2.ncsa.teragrid.org cpi
MPICH-VMI2 Support • Support • help@teragrid.org • Mailing lists: http://vmi.ncsa.uiuc.edu/mailingLists.php • Announcements: vmi-announce@yams.ncsa.uiuc.edu • Users: vmi-user@yams.ncsa.uiuc.edu • Developers: vmi-devel@yams.ncsa.uiuc.edu