1 / 30

Managing Heterogeneous MPI Application Interoperation and Execution.

Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg *, Kevin S. London, Jack J. Dongarra and Shirley V. Browne. University of Tennessee and Oak Ridge National Laboratory. * contact at fagg@cs.utk.edu.

dawn
Download Presentation

Managing Heterogeneous MPI Application Interoperation and Execution.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J. Dongarra and Shirley V. Browne. University of Tennessee and Oak Ridge National Laboratory * contact at fagg@cs.utk.edu

  2. Project Targets • Allow intercommunication between different MPI implementations or instances of the same implementation on different machines • Provide heterogeneous intercommunicating MPI applications with access to some of the MPI-2 process control and parallel I/O features • Allow use of optimized vendor MPI implementations while still permitting distributed heterogeneous parallel computing in a transparent manner.

  3. MPI-1 Communicators • All processes in an MPI-1 application belong to a global communicator called MPI_COMM_WORLD • All other communicators are derived from this global communicator. • Communication can only occur within a communicator. • Safe communication

  4. MPI InternalsProcesses • All process groups are derived from the membership of the MPI_COMM_WORLD communicator. • i.e.. no external processes. • MPI-1 process membership is static not dynamic like PVM or LAM 6.X • simplified consistency reasoning • fast communication (fixed addressing) even across complex topologies. • interfaces well to simple run-time systems as found on many MPPs.

  5. MPI-1 Application MPI_COMM_WORLD Derived_Communicator

  6. Disadvantages of MPI-1 • Static process model • If a process fails, all communicators it belongs to become invalid. I.e. No fault tolerance. • Dynamic resources either cause applications to fail due to loss of nodes or make applications inefficient as they cannot take advantage of new nodes by starting/spawning additional processes. • When using a dedicated MPP MPI implementation you cannot usually use off-machine or even off- partion nodes.

  7. MPI-2 • Problem areas and needed additional features identified in MPI-1 are being addressed by the MPI-2 forum. • These include: • inter-language operation • dynamic process control / management • parallel IO • extended collective operations • Support for inter-implementation communication/control was considered ?? • See other projects such as NIST IMPI, PACX and PLUS

  8. User requirements for Inter-Operation • Dynamic connections between tasks on different platforms/systems/sites • Transparent inter-operation once it has started • Single style of API, i.e. MPI only! • Access to virtual machine / resource management when required not mandatory use • Support for complex features across implementations such as user derived data types etc.

  9. MPI_Connect/PVMPI 2 • API in the MPI-1 style • Inter-application communication using standard point to point MPI-1 functions calls • Supporting all variations of send and receive • All MPI data types including user derived • Naming Service functions similar to the semantics and functionality of current MPI inter-communicator functions • ease of use for programmers only experienced in MPI and not other message passing systems such as PVM / Nexus or TCP socket programming!

  10. Process Identification • Process groups are identified by a character name. Individual processes are address by a set of tuples. • In MPI: • {communicator, rank} • {process group, rank} • In MPI_Connect/PVMPI2: • {name, rank} • {name, instance} • Instance and rank are identical and range from 0..N-1 where N is the number of processes.

  11. Registration I.e. naming MPI applications • Process groups register their name with a global naming service which returns them a system handle used for future operations on this name / communicator pair. • int MPI_Conn_register (char *name, MPI_Comm local_comm, int * handle); • call MPI_CONN_REGISTER (name, local_comm, handle, ierr) • Processes can remove their name from the naming service with: • int MPI_Conn_remove (int handle); • A process may have multiple names associated with it. • Names can be registered, removed and reregistered multiple times without restriction.

  12. MPI-1 Inter-communicators • An inter-communicator is used for point to point communication between disjoint groups of processes. • Inter-communicators are formed using MPI_Intercomm_create() which operates upon two existing non-overlapping intra-communicator and a bridge communicator. • MPI_Connect/PVMPI could not use this mechanism as there is not a MPI bridge communicator between groups formed from separate MPI applications as their MPI_COMM_WORLDs do not overlap.

  13. Forming Inter-communicators • MPI_Connect/PVMPI2 forms its inter-communicators with a modified MPI_Intercomm_create call. • The bridging communication is performed automatically and the user only has to specify the remote groups registered name. • int MPI_Conn_intercomm_create (int local_handle, char *remote_group_name, MPI_Comm *new_inter_comm); • Call MPI_CONN_INTERCOMM_CREATE (localhandle, remotename, newcomm, ierr)

  14. Inter-communicators • Once an inter-communicator has been formed it can be used almost exactly as any other MPI inter-communicator • All point to point operations • Communicator comparisons and duplication • Remote group information • Resources released by MPI_Comm_free()

  15. Simple exampleAir Model /* air model */ MPI_Init (&argc, &argv); MPI_Conn_register (“AirModel”, MPI_COMM_WORLD, &air_handle); MPI_Conn_intercomm_create ( handle, “OceanModel”, &ocean_comm); MPI_Comm_rank( MPI_COMM_WORLD, &myrank); while (!done) { /* do work using intra-comms */ /* swap values with other model */ MPI_Send( databuf, cnt, MPI_DOUBLE, myrank, tag, ocean_comm); MPI_Recv( databuf, cnt, MPI_DOUBLE, myrank, tag, ocean_comm, status); } /* end while done work */ MPI_Conn_remove ( air_handle ); MPI_Comm_free ( &ocean_comm); MPI_Finalize();

  16. Ocean model /* ocean model */ MPI_Init (&argc, &argv); MPI_Conn_register (“OceanModel”, MPI_COMM_WORLD, &ocean_handle); MPI_Conn_intercomm_create ( handle, “AirModel”, &air_comm); MPI_Comm_rank( MPI_COMM_WORLD, &myrank); while (!done) { /* do work using intra-comms */ /* swap values with other model */ MPI_Recv( databuf, cnt, MPI_DOUBLE, myrank, tag, air_comm, &status); MPI_Send( databuf, cnt, MPI_DOUBLE, myrank, tag, air_comm); } MPI_Conn_remove ( ocean_handle ); MPI_Comm_free ( &air_comm); MPI_Finalize();

  17. Coupled model MPI Application Ocean Model MPI Application Air Model MPI_COMM_WORLD MPI_COMM_WORLD air_comm -> <- ocean_comm Global inter-communicator

  18. MPI_Connect InternalsI.e. SNIPE • SNIPE is a meta-computing system from UTK that was designed to support long-term distributed applications. • Uses SNIPE as a communications layer. • Naming services is provided by the RCDS RC_Server system (which is also used by HARNESS for repository information). • MPI application startup is via SNIPE daemons that interface to standard batch/queuing systems • These understand LAM, MPICH, MPIF (POE), SGI MPI, squb variations • soon condor and lsf

  19. PVMPI2 InternalsI.e. PVM • Uses PVM as a communications layer • Naming services is provided by the PVM Group Server in PVM3.3.x and by the Mailbox system in PVM 3.4. • PVM 3.4 is simpler as it has user controlled message contexts and message handlers. • MPI application startup is provided by specialist PVM tasker processes. • These understand LAM, MPICH, IBM MPI (POE), SGI MPI

  20. Internalslinking with MPI • MPI_Connect and PVMPI are built as a MPI profiling interface. Thus they are transparent to user applications. • During building it can be configured to call other profiling interfaces and hence allow inter-operation with other MPI monitoring/tracing/debugging tool sets.

  21. MPI_Connect / PVMPI Layering Intercomm Library Users Code Look up communicators etc MPI_function If true MPI intracomm then use profiled MPI call PMPI_Function Else translate into SNIPE/PVM addressing and use SNIPE/PVM functions other library Work out correct return code Return code

  22. Process Management • PVM and/or SNIPE can handle the startup of MPI jobs • General Resource Manager / Specialist PVM Tasker control / SNIPE daemons • Jobs can also be started by MPIRUN • useful when testing on interactive nodes • Once enrolled, MPI processes are under SNIPE or PVM control • signals (such as TERM/HUP) • notification messages (fault tolerance)

  23. Process Management SGI O2K IBM SP2 MPICH Cluster MPICH Tasker POE Tasker Pbs/qsub Tasker and PVMD User Request GRM

  24. Conclusions • MPI_Connect and PVMPI allow different MPI implementations to inter-operate. • Only 3 additional calls required. • Layering requires a full profiled MPI library (complex linking). • Intra-communication performance may be slightly effected. • Inter-communication as fast as intercomm library used (either PVM or SNIPE).

  25. MPI_Connect interface much like Names, addresses and ports in MPI-2 • Well know address host:port • MPI_PORT_OPEN makes a port • MPI_ACCEPT lets client connect • MPI_CONNECT client side connection • Service naming • MPI_NAME_PUBLISH (port, info, service) • MPI_NAME_GET client side to get port

  26. Server-Client model MPI_Accept() Server Inter-comm Client {host:port} MPI_Connect

  27. Server-Client model MPI_Port_open() Server {host:port} MPI_Name_publish NAME Client MPI_Name_get {host:port}

  28. SNIPE

  29. SNIPE • Single GlobalName space • Built using RCDS supports URLs,URNs and LIFNs • testing against LDAP • Scalable / Secure • Multiple Resource Managers • Safe execution environment • Java etc.. • Parallelism is the basic unit of execution

  30. Additional Information • PVM : http://www.epm.ornl.gov/pvm • PVMPI : http://icl.cs.utk.edu/projects/pvmpi/ • MPI_Connect : http://icl.cs.utk.edu/projects/mpi_connect/ • SNIPE : htp://www.nhse.org/snipe/ and http://icl.cs.utk.edu/projects/snipe/ • RCDS : htp://www.netlib.org/utk/projects/rcds/ • ICL : http://www.netlib.org/icl/ • CRPC : http://www.crpc.rice.edu/CRPC • DOD HPC MSRCs • CEWES : http://www.wes.hpc.mil • ARL http://www.arl.hpc.mil • ASC http://www.asc.hpc.mil

More Related