150 likes | 252 Views
The Check-Pointed and Error-Recoverable MPI Java of AgentTeamwork Grid Computing Middleware. Munehiro Fukuda and Zhiji Huang Computing & Software Systems, University of Washington, Bothell Funded by. Background. Target applications in most grid-computing systems
E N D
The Check-Pointed and Error-Recoverable MPI Java of AgentTeamwork Grid Computing Middleware Munehiro Fukuda and Zhiji Huang Computing & Software Systems, University of Washington, Bothell Funded by IEEE PacRim 2005
Background • Target applications in most grid-computing systems • Communication takes place at the beginning and the end of each sub task. • A crashed sub task will be simply repeated. • Example: Master-worker and parameter-sweep models • Fault tolerance • FT-MPI or MPI in Legion/Avaki • The system will not recover lost messages. • Users must specify variables to save and add a function called from MPI.Init( ) upon a job resumption. • Condor MW • Messages between the master and each slave will be saved. • No inter-slave communication will be check-pointed. • Rock/Rack • Socket buffers will be saved at application level. • A process must be mobile-aware to keep track of its communication counterpart. IEEE PacRim 2005
Objective • More programming models • Not restricted to master slave or parameter sweep • Targeting heartbeat, pipeline, and collective-communication-oriented applications • Process resumption in its middle • Resuming a process from the last checkpoint. • Allowing process migration for performance improvement • Error-recovery support from sockets to MPI • Facilitating check-pointed error-recoverable Java socket. • Implementing mpiJava API with our fault-tolerant socket. IEEE PacRim 2005
User A’s Process User A’s Process User B’s Process TCP Communication Snapshot Methods Snapshot Methods GridTCP GridTCP Snapshot Methods GridTCP User program wrapper User program wrapper Results Results User program wrapper snapshot snapshot snapshot User A Sentinel Agent Sentinel Agent Sentinel Agent User B Commander Agent Resource Agent Resource Agent Commander Agent FTP Server snapshots snapshots System Overview IEEE PacRim 2005 BookkeeperAgent Bookkeeper Agent
mpiJava-S mpiJava-A GridTcp User program wrapper Execution Layer Java user applications mpiJava API mpiJava-S mpiJava-A Java socket GridTcp User program wrapper Commander, resource, sentinel, and bookkeeper agents UWAgents mobile agent execution platform Operating systems IEEE PacRim 2005
Programming Interface public class MyApplication { public GridIpEntry ipEntry[]; // used by the GridTcp socket library public int funcId; // used by the user program wrapper public GridTcp tcp; // the GridTcp error-recoverable socket public int nprocess; // #processors public int myRank; // processor id ( or mpi rank) public int func_0( String args[] ) { // constructor MPJ.Init( args, ipEntry, tcp );// invoke mpiJava-A .....; // more statements to be inserted return 1; // calls func_1( ) } public int func_1( ) { // called from func_0 if ( MPJ.COMM_WORLD.Rank( ) == 0 ) MPJ.COMM_WORLD.Send( ... ); else MPJ.COMM_WORLD.Recv( ... ); .....; // more statements to be inserted return 2; // calls func_2( ) } public int func_2( ) { // called from func_2, the last function .....; // more statements to be inserted MPJ.finalize( );// stops mpiJava-A return -2; // application terminated } } IEEE PacRim 2005
MPJ Package MPJ Init( ), Rank( ), Size( ), and Finalize( ) Communicator All communication functions: Send( ), Recv( ), Gather( ), Reduce( ), etc. JavaComm mpiJava-S: uses java sockets and server sockets. GridComm mpiJava-A: uses GridTcp sockets. DataType MPJ.INT, MPJ.LONG, etc. • InputStream for each rank • OutputStream for each rank • User a permanent 64K buffer for serialization • Emulate collective communication sending the same data to each OutputStream, which deteriorates performance MPJMessage getStatus( ), getMessage( ), etc. Op Operate( ) etc Other utilities IEEE PacRim 2005
User Program Wrapper rank ip 1 n1.uwb.edu user program 2 n2.uwb.edu n3.uwb.edu incoming TCP ougoing backup User Program Wrapper rank ip 1 n1.uwb.edu user program 2 n3.uwb.edu incoming TCP ougoing backup GridTcp – Check-Pointed Connection User Program Wrapper user program TCP outgoing backup incoming Snapshot maintenance n1.uwb.edu n2.uwb.edu • Outgoing packets saved in a backup queue • All packets serialized in a backup file every check pointing • Upon a migration • Packets de-serialized from a backup file • Backup packets restored in outgoing queue • IP table updated n3.uwb.edu IEEE PacRim 2005
GridTcp – Over-Gateway Connection User Program Wrapper User Program Wrapper User Program Wrapper User Program Wrapper user program user program user program user program medusa.uwb.edu (rank 1) uw1-320.uwb.edu (rank 2) • RIP-like connection • Restriction: each node name must be unique. uw1-320-00 (rank 3) mnode0 (rank 0) IEEE PacRim 2005
User Program Wrapper User Program Wrapper Source Code func_0( ) { statement_1; statement_2; statement_3; return 1; } func_1( ) { statement_4; statement_5; statement_6; return 2; } func_2( ) { statement_7; statement_8; statement_9; return -2; } int fid = 1; while( fid == -2) { switch( func_id ) { case 0: fid = func_0( ); case 1: fid = func_1( ); case 2: fid = func_2( ); } } check_point( ) { // save this object // including func_id // into a file } statement_1; statement_2; statement_3; statement_4; statement_5; statement_6; statement_7; statement_8; statement_9; check_point( ); check_point( ); check_point( ); Preprocessed IEEE PacRim 2005
Preproccesser and Drawback Preprocessed Code Preprocessed Source Code int func_0( ) { statement_1; statement_2; statement_3; return 1; } int func_1( ) { while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement_8; } } int func_2( ) { statement_6; statement_8; while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement8; } } statement_1; statement_2; statement_3; check_point( ); while (…) { statement_4; if (…) { statement_5; check_point( ); statement_6; } else statement_7; statement_8; } check_point( ); • No recursions • Useless source line numbers indicated upon errors • Still need of explicit snapshot points. Before check_point( ) in if-clause After check_point( ) in if-clause IEEE PacRim 2005
Sentinel Agent User Program Wrapper rank ip Main Thread SendSnapshot Thread 1 n1.uwb.edu 2 n2.uwb.edu n3.uwb.edu Bookkeeper Agent snapshot snapshot user program TCP ReceiveMsg Thread TCPError Thread outgoing backup Resumed Sentinel Agent incoming Restart message (a new rank/ip pair) MPI Connection MPI Job Coordination UWPlace (UWAgent Execution Platform) IEEE PacRim 2005
MPJ.Send and Recv Performance IEEE PacRim 2005
MPJ.Bcast Performance - Doubles IEEE PacRim 2005
Conclusions • Raw bandwidth • mpiJava-S comes to about 95-100% of maximum Java performance. • mpiJava-A (with check-pointing and error recovery) incurs 20-60% overhead, but still overtakes mpiJava with bigger data segments. • Serialization • When dealing with primitives or objects that need serialization, a 25-50% overhead is incurred. • Memory issues related to mpiJavaA • Due to snapshots created every func_n call. • Next work • Performance and memory-usage improvement • Preprocessor implementation IEEE PacRim 2005