MPI_Connect and Parallel I/O for Distributed Applications Dr Graham E Fagg

MPI_Connect and Parallel I/O for Distributed Applications Dr Graham E Fagg Innovative Computing Laboratory University of Tennessee Knoxville, TN 37996-1301

Overview • Aims of Project • MPI_Connect (PVMPI) • Changes to MPI_Connect • PVM • SNIPE and RCDS • Single Thread comms • Multi-threaded comms

Overview (cont) • Parallel IO • Subsystems • ROMIO (MPICH) • High performance Platforms • File handling and management • Pre-caching • IBP and SNIPE-Lite • Experimental system

Overview (cont) • Future Work • File handling and exemplars

Aims of Project • Continue development of MPI_Connect • Enhance features as requested by users • Support Parallel IO (as in MPI-2 Parallel IO) • Support complex file management across systems and sites • As we already support computation why not the input and results files as well?

Aims of Project (cont) • Support better scheduling of application runs across sites and systems • I.e. gang scheduling of both processors and pre-fetching of data (logistical scheduling)

Background on MPI_Connect • What is MPI_Connect? • System that allows two or more high performance MPI applications to inter-operate across systems/sites. • Allows each application to use the tuned vendor supplied MPI implementation which out forcing loss of local performance that occurs with systems like MPICH (p2) and Global MPI (nexus MPI).

Coupled model example MPI Application Ocean Model MPI Application Air Model MPI_COMM_WORLD MPI_COMM_WORLD air_comm -> <- ocean_comm Global inter-communicator MPI_Connect

MPI_Connect • Application developer just adds three extra calls to an application to allow it to inter-operate with any other application. • MPI_Conn_register, MPI_Conn_intercomm_create, MPI_Conn_remove • Once above calls added normal MPI point-2-point calls can be used to send message between systems. • Only requirements are that they can both access a common name service (usually via IP) and that the MPI implementation has a profiling layer available.

Intercomm Library Users Code Look up communicators etc MPI_function If true MPI intracomm then use profiled MPI call PMPI_Function Else translate into SNIPE/PVM addressing and use SNIPE/PVM functions other library Work out correct return code Return code MPI_Connect

Changes to MPI_Connect • PVM • Worked well for the SC98 High Performance Computing Challenge demo. • BUT • Not everything worked as well as it should • PVM got in the way of the IBMs POE job control system. • Async messages were not non-blocking asynchronous • As discovered by SPLICE team.

MPI_Connect and SNIPE • SNIPE (Scalable Networked Information Processing Environment) • Was seen as a replacement for PVM • No central point of failure • High speed reliable communications with limited QoS • Powerful, Secure MetaData service based on RCDS

MPI_Connect and SNIPE • SNIPE used RCDS for its name service • This worked on the Cray SGI Origin and IBM SP systems but did not and still does not work on the Cray T3E (jim). • Solution • Kept the communications (SNIPE_Lite) and dropped RCDS for a custom name service daemon

MPI_Connect and single threaded communications • SNIPE_Lite communications library was by default single threaded • Note: single threaded nexus is also called nexuslite. • This meant that asynchronous non-blocking calls just became non-blocking and no progress could be made while outside of an MPI call • (just as in the PVM case when using direct IP sockets)

Single threaded communications Message sent in MPI_Isend() Between different systems Each time an MPI call is called the sender can check the out going socket and Force some more data through it. The socket should be marked nonblocking So that the MPI application cannot be deadlocked due to the actions of an External system. I.e. the system does not make progress. When the users application does a MPI_Wait() this communication is forced Through to completion.

Multi-thread Communications • Solution was to use multi-threaded communications for external communications. • 3 threads in initial implementation • 1 send thread • 1 receive thread • 1 control thread that handles name service requests, setting up connections to external applications

Multi-threaded Communications • How it works? • Sends put message descriptions onto a send-queue • Receive operation put requests on a receive queue • If the operation is blocking then the caller is suspended until the a condition arrises that would wake them up (using condition variables) • While the main thread continues after ‘posting’ a non-blocking operation the threading library steals cycles to send/receive the message.

Multi-threaded communicationsPerformance Test done by posting a Non-blocking send and measuring the number of operations the main thread could perform while waiting on the ‘send’ to complete. System switched to non-blocking TCP sockets when more than one external connecton was open.

MPI_Connect used? • On some machines we could not use Globus to interconnect them without too much internal loss of performance.. • 3 large IBM SPs • Single computation, 5800 CPUs, over 6 hours computation and a peak performance of 2.144 TFlops

Parallel IO • Parallel IO allows parallel users applications access to large volumes of data in such a way that by avoiding sharing, throughput can be increased by optimizations at the OS and H/W architecture levels. • MPI-2 provides an API for access high performance Parallel IO subsystems.

Parallel IO • Most Parallel IO implementations are built from ROMIO a model implementation supplied with MPICH 1.1.2. • SGI Origin at CEWES is MPT 1.2.1.0 and the version needed is MPT 1.3. • Cray T3E (jim) is MPT 1.2.1.3 but (1.3 and 1.3.01 (patched) are reported by the system)

File handling and Management • MPI_Connect handles the communication between separate MPI applications BUT it does not handle the files that they work on or produce. Aims of the second half of the project are to provide users of MPI_Connect the ability to share files across multiple system and sites in such a way that it complements their application execution and the use of MPI-2 Parallel IO.

File handling and Management • This project should produce tools that allow applications to share whole files, part of files and allow these files to accessed by running application no matter where they execute. • If an application runs a CEWES or ASC, at the beginning of the run, the input file should be in a single location and at the end of the run the result file should be also in a single location regardless of where the application executed.

File handling and Management • Two systems used: • Internet Backplane Protocol IBP • Part of the Internet 2 project Distributed Storage Infrastructure (I2DSI). • Code developed at UTK. Tested on the I2 system. • SNIPE_Lite store&forward daemon (SFD) • SNIPE_Lite already used by MPI_Connect. • Code developed at UTK.

File handling and management System uses five extra command: MPI_Conn_getfile MPI_Conn_getfile_view MPI_Conn_putfile MPI_Conn_putfile_view MPI_Conn_releasefile Getfile gets a file from a central location into the local ‘parallel’ filesystem. Putfile puts a file in a local filesystem into a central location _View versions work on subsets of files.

File handling and management • Example code: MPI_Init(&argc, &argv) …… MPI_Conn_getfile_view (mydata, myworkdata, me, num_of_apps, & size); /* Get my part of the file called mydata and call it myworkdata */ … MPI_File_open (MCW, myworkdata,…..) /* file is now available via MPI-2 IO */

Test-bed • Between two clusters of Sun UltraSparc systems. • MPI applications are implemented using MPICH. • MPI Parallel IO is implemented using ROMIO • System is tested with both IBP and SNIPE_Lite as the user API is the same.

Test-bed • Networking tests between three sites • UTK • ERDC MSRC • some DoE site

Example MPI_App 1 MPI_App_2 Input file File Supprt Daemon IBP or SLSFD

Example MPI_App 1 MPI_App_2 Input file Getfile (1,2..) Getfile (0,2..) File Supprt Daemon IBP or SLSFD

Example Input file MPI_App 1 MPI_App_2 Getfile (1,2..) Getfile (0,2..) File Supprt Daemon IBP or SLSFD

Example Input file MPI_App 1 MPI_App_2 Getfile (1,2..) Getfile (0,2..) File Supprt Daemon IBP or SLSFD File data passed in block fashion So that it does not overload daemon.

Example Input file MPI_App 1 MPI_App_2 Files are now ready to be opened by the MPI-2 MPI_File_open function. Input file Input file File Supprt Daemon IBP or SLSFD

Changes.. • Single interface • I.e. no get and put file • All work done at MPI_File_open ( ‘global_name’ ) • Catches the open and copies the file using the profiling interfaces… • Also have utilities that move files for you….

A tail of two networks Site B Site D o E Site A

A tail of two networks Site B Site D o E BW ? Site A

A tail of two networks Site B Site D o E 500KB/Sec Site A

A tail of two networks Site B Site D o E 100 KB/Sec Site A

A tail of two networks Site B 300-500 KB/Sec Site D o E 300-600 KB/Sec Site A

A tail of two networks Site B ? Site D o E ? ? Site A

A tail of two networks Site B Site D o E Site A

A tail of two networks Site B Also.. Watch out for the speed traps ! FIRE WALLS ! Site D o E Site A

Future • More changes under the covers • catching the etypes and changing them so that only the smallest amount of data possible is loaded on each disk • MPI_Connect PIO layer becomes a device used by HDF5… • Who uses MPI-2 PIO anyway???

MPI_Connect and Parallel I/O for Distributed Applications Dr Graham E Fagg