CSS490 Group Communication and MPI Textbook Ch3

CSS490 Group Communication and MPI Textbook Ch3 Instructor: Munehiro Fukuda These slides were compiled from the course textbook, the reference books, and the instructor’s original materials. CSS490 MPI

Group Communication • Communication types: • One-to-many: broadcast • Many-to-one: synchronization, collective communication • Many-to-many: gather and scatter • Group addressing • Using a special network address: IP Class D and UDP • Emulating a broadcast with one-to-one communication: • Performance drawback on bus-type networks • Simpler for switching-based networks • Semantics • Send-to-all, bulletin-board semantics • 0-, 1-, m-out-of-n, all-reliable CSS490 MPI

Atomic Multicast • Send-to-all semantics and all-reliable • Simple emulation: • A repetition of one-to-on communication with acknowledgment • What if a receiver fails • Time-out retransmission • What if a sender fails before all receivers receive a message • All receivers forward the message to the same group. • A receiver discard the 2nd or the following messages. CSS490 MPI

Message Ordering • R1 and R2 receive m1 and m2 in a different order! • Some message ordering required • Absolute ordering • Consistent ordering • Causal ordering • FIFO ordering S2 S1 R1 R2 m2 m1 m1 m2 CSS490 MPI

Absolute Ordering • Rule: • Mi must be delivered before mj if Ti < Tj • Implementation: • A clock synchronized among machines • A sliding time window used to commit message delivery whose timestamp is in this window. • Example: • Distributed simulation • Drawback • Too strict constraint • No absolute synchronized clock • No guarantee to catch all tardy messages Ti < Tj Ti mi Tj mi mj mj CSS490 MPI

Consistent Ordering • Rule: • Messages received in the same order (regardless of their timestamp). • Implementation: • A message sent to a sequencer, assigned a sequence number, and finally multicast to receivers • A message retrieved in incremental order at a receiver • Example: • Replicated database updation • Drawback: • A centralized algorithm Ti < Tj Ti Tj mj mj mi mi CSS490 MPI

Causal Ordering • Rule: • Happened-before relation • If eki, eli ∈h and k < l, then eki → eli, • If ei = send(m) and ej = receive(m), then ei → ej, • If e → e’ and e’ → e”, then e → e” • Implementation: • Use of a vector message • Example: • Distributed file system • Drawback: • Vector as an overhead • Broadcast assumed R1 S1 R2 S2 R3 m4 m1 m1 m4 m2 m2 m3 From R2’s view point m1 →m2 CSS490 MPI

Vector Message • S[i] = R[i] + 1 where i is the source id • S[j] ≤ R[j] where i≠j Site B Site C Site D Site A 1, 1, 1, 0 2, 1, 0, 0 2, 1, 1, 0 2, 1, 1, 0 3,1,1,0 delayed delayed delivered CSS490 MPI

FIFO Ordering • Rule: • Messages received in the same order as they were sent. • Implementation: • Messages assigned a sequence number • Example: • TCP • This is the weakest ordering. S R m1 Router 1 m2 m1 m3 m2 m4 m3 Router 2 m4 CSS490 MPI

Why High-Level Message Passing Tools? • Data formatting • Data formatted into appropriate types at user level • Non-blocking communication • Polling and interrupt handled at system call level • Process addressing • Inflexible hardwired addressing with machine id + local id • Group communication • Group server implemented at user level • Broadcasting simulated by a repetition of one-to-one communication CSS490 MPI

PVM and MPI • PVM: Parallel Virtual Machine • Developed in 80’s • The pioneer library to provide high-level message passing functions • The PVM daemon process taking care of message transfer for user processes in background • MPI: Message Passing Interface • Defined in 90’s • The specification of high-level message passing functions • Several implementations available: mpich, mpi-lam • Library functions directly linked to user programs (no background daemons) • The detailed difference is shown by: • PVMvsMPI.ps CSS490 MPI

Getting Started with MPI • Website: http://www-unix.mcs.anl.gov/mpi/mpich/ • Creating a hostfile: [mfukuda@UW1-320-00 mfukuda]$ vi hosts uw1-320-00 uw1-320-01 uw1-320-02 uw1-320-03 • Compile a source program: [mfukuda@UW1-320-00 mfukuda]$ mpiCC source.cpp –o myProg • Run the executable file: [mfukuda@UW1-320-00 mfukuda]$ mpirun –np 4 myProg args CSS490 MPI

Program Using MPI #include <iostream.h> #include "mpi++.h" int main(int argc, char *argv[]) { MPI::Init(argc, argv); // Start MPI computation int rank = MPI::COMM_WORLD.Get_rank(); // Process ID (from 0 to #processes – 1) int size = MPI::COMM_WORLD.Get_size(); // # participating processes cout << "Hello World! I am " << rank << " of " << size << endl; MPI::Finalize(); // Finish MPI computation } CSS490 MPI

MPI_Send and MPI_Recv Int MPI::COMM_WORLD.Send( void* message /* in */, int count /* in */, MPI::Datatype datatype /* in */, int dest /* in */, int tag /* in */) Int MPI::COMM_WORLD.Recv( void* message /* in */, int count /* in */, MPI::Datatype datatype /* in */, int source /* in */, /* MPI::ANY_SOURCE */ int tag /* in */, MPI::Status* status /* out */) /* can be omitted */ MPI::Datatype = CHAR, SHORT, INT, LONG UNSIGNED_CHAR, UNSIGNED_SHORT, UNSIGNED, UNSIGNED_LONG, FLOAT, DOUBLE, LONG_DOUBLE, BYTE, PACKED MPI::Status->MPI_SOURCE, MPI::Status->MPI_TAG, MPI::MPI_ERROR CSS490 MPI

MPI_Send and MPI_Recv #include <iostream.h> #include "mpi++.h" main(int argc, char *argv[]) { int tag0 = 0; MPI::Init(argc, argv); // Start MPI computation if (MPI::COMM_WORLD.Get_rank() rank == 0 ) { // rank 0…sender int loop = 3; MPI::COMM_WORLD.Send( "Hello World!", 12, MPI::CHAR, 1, tag0 ); MPI::COMM_WORLD.Send( &loop, 1, MPI::INT, 1, tag0 ); } else { // rank 1…receiver int loop; char msg[12]; MPI::COMM_WORLD.Recv( msg, 12, MPI::CHAR, 0, tag0 ); MPI::COMM_WORLD.Recv( &loop, 1, MPI::INT, 0, tag0 ); for (int I = 0; I < loop; I++ ) cout << msg << endl; } MPI::Finalize(); // Finish MPI computation } CSS490 MPI

Message Ordering in MPI • FIFO Ordering in each data type • Messages reordered with a tag in each data type Source Destination Source Destination tag = 1 tag = 3 tag = 2 CSS490 MPI

MPI_Bcast Int MPI::COMM_WORLD.Bcast( void* message /* in */, int count /* in */, MPI::Datatype datatype /* in */, int root /* in */) Rank 3 Rank 2 Rank 4 Rank 1 Rank 0 MPI::COMM_WORLD.Bcast( &msg, 1, MPI::INT, 2); CSS490 MPI

MPI_Reduce Int MPI::COMM_WORLD.Reduce( void* operand /* in */, void* result /* out */, int count /* in */, MPI::Datatype datatype /* in */, MPI::Op operator /* in */, int root /* in */) MPI::Op = MPI::MAX (Maximum), MPI::MIN (Minimum), MPI::SUM (Sum), MPI::PROD (Product), MPI::LAND (Logical and), MPI::BAND (Bitwise and), MPI::LOR (Logical or), MPI::BOR (Bitwise or), MPI::LXOR (logical xor), MPI::BXOR(Bitwise xor), MPI::MAXLOC (MAX location) MPI::MINLOC (MIN loc.) Rank3 8 Rank2 12 Rank4 4 Rank1 10 Rank0 15 49 MPI::COMM_WORLD.Reduce( &msg, &result, 1, MPI::INT, MPI::SUM, 2); CSS490 MPI

MPI_Allreduce Int MPI::COMM_WORLD.Allreduce( void* operand /* in */, void* result /* out */, int count /* in */, MPI::Datatype datatype /* in */, MPI::Op operator /* in */) 0 1 2 4 3 5 6 7 0 1 2 4 3 5 6 7 0 1 2 4 3 5 6 7 0 1 2 4 3 5 6 7 CSS490 MPI

Exercises (No turn-in) • Consider an application requiring both one-to-many and many-to-one communication. • Consider an application requiring atomic multicast. • Assume that four processes communicate with one another in causal ordering. Their current vectors are show below. If Process A sends a message, which processes can receive it immediately? • Consider pros and cons of PVM’s daemon-based and MPI’s library linking-based message passing. • Why can MPI maintain FIFO ordering? CSS490 MPI

CSS490 Group Communication and MPI Textbook Ch3