790 likes | 942 Views
Friday, October 13, 2006. The biggest difference between time and space is that you can't reuse time. M. Furst. machinefile is a text file (also called boot schema file) containing the following:. hpcc.lums.edu.pk compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local
E N D
Friday, October 13, 2006 The biggest difference between time and space is that you can't reuse time. • M. Furst
machinefile is a text file (also called boot schema file) containing the following: hpcc.lums.edu.pk compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local compute-0-6.local
hpcc.lums.edu.pk compute-0-0.local compute-0-1.local compute-0-2.local compute-0-3.local compute-0-4.local compute-0-5.local compute-0-6.local • lamboot –v machinefile • Launches LAM runtime environment • mpirun –np 4 hello • launches 4 copies of hello • Scheduling of copies is implementation dependent. • LAM will schedule in a round-robin fashion on every node depending on the number of CPUs listed per node.
#include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } hpcc.lums.edu.pk compute-0-1.local compute-0-0.local #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } compute-0-2.local
#include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size, namelen; char name[100]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &namelen); printf("Rank:%d Name:%s\n", rank,name); MPI_Finalize(); return 0; }
mpirun -np 4 pname Rank:0 Name:hpcc.lums.edu.pk Rank:2 Name:compute-0-1.local Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local
mpirun -np 4 pname Rank:0 Name:hpcc.lums.edu.pk Rank:2 Name:compute-0-1.local Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local Processes on remote nodes have their stdout redirected to that of mpirun
mpirun -np 8 pname Rank:0 Name:hpcc.lums.edu.pk Rank:2 Name:compute-0-1.local Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local Rank:4 Name:compute-0-3.local Rank:5 Name:compute-0-4.local Rank:6 Name:compute-0-5.local Rank:7 Name:compute-0-6.local
mpirun -np 16 pname Rank:0 Name:hpcc.lums.edu.pk Rank:8 Name:hpcc.lums.edu.pk Rank:1 Name:compute-0-0.local Rank:3 Name:compute-0-2.local Rank:11 Name:compute-0-2.local Rank:7 Name:compute-0-6.local Rank:4 Name:compute-0-3.local Rank:2 Name:compute-0-1.local Rank:5 Name:compute-0-4.local Rank:6 Name:compute-0-5.local Rank:9 Name:compute-0-0.local Rank:15 Name:compute-0-6.local Rank:12 Name:compute-0-3.local Rank:10 Name:compute-0-1.local Rank:13 Name:compute-0-4.local Rank:14 Name:compute-0-5.local
Suppose boot schema file contains: hpcc.lums.edu.pk cpu=2 compute-0-0.local cpu=2 compute-0-1.local cpu=2 compute-0-2.local cpu=2 compute-0-3.local cpu=2 compute-0-4.local cpu=2 compute-0-5.local cpu=2 compute-0-6.local cpu=2
mpirun -np 8 pname Rank:0 Name:hpcc.lums.edu.pk Rank:1 Name:hpcc.lums.edu.pk Rank:4 Name:compute-0-1.local Rank:2 Name:compute-0-0.local Rank:6 Name:compute-0-2.local Rank:3 Name:compute-0-0.local Rank:7 Name:compute-0-2.local Rank:5 Name:compute-0-1.local
mpirun -np 16 pname Rank:0 Name:hpcc.lums.edu.pk Rank:1 Name:hpcc.lums.edu.pk Rank:8 Name:compute-0-3.local Rank:2 Name:compute-0-0.local Rank:6 Name:compute-0-2.local Rank:10 Name:compute-0-4.local Rank:14 Name:compute-0-6.local Rank:4 Name:compute-0-1.local Rank:12 Name:compute-0-5.local Rank:3 Name:compute-0-0.local Rank:7 Name:compute-0-2.local Rank:9 Name:compute-0-3.local Rank:13 Name:compute-0-5.local Rank:11 Name:compute-0-4.local Rank:15 Name:compute-0-6.local Rank:5 Name:compute-0-1.local
mpirun C hello • Launch one copy of hello on every CPU that was listed in the boot schema • mpirun N hello • Launch one copy of hello on every node in the LAM universe (disregards CPU count)
int main(int argc, char *argv[]) { int rank, size; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if(rank %2==0){ printf("Rank:%d, I am EVEN\n", rank); } else { printf("Rank:%d, I am ODD\n", rank); } MPI_Finalize(); return 0; }
mpirun -np 8 rpdt Rank:0, I am EVEN Rank:2, I am EVEN Rank:1, I am ODD Rank:5, I am ODD Rank:3, I am ODD Rank:7, I am ODD Rank:6, I am EVEN Rank:4, I am EVEN
Point to point communication MPI_Recv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) MPI_Send (void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )
int main(int argc, char *argv[]){ int rank, size, source=0, dest=1, tag=12; float sent=23.65, recv; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); if(rank ==0){ MPI_Send(&sent, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); printf("I am %d of %d Sent %f\n", rank, size, sent); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &status); printf("I am %d of %d Received %f\n",rank,size,recv); } MPI_Finalize(); return 0; }
mf2 is a text file containing the following: hpcc.lums.edu.pk compute-0-0.local lamboot -v mf2 LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1<10818> ssi:boot:base:linear: booting n0 (hpcc.lums.edu.pk) n-1<10818> ssi:boot:base:linear: booting n1 (compute-0-0.local) n-1<10818> ssi:boot:base:linear: finished mpirun -np 2 sendrecv I am 0 of 2 Sent 23.650000 I am 1 of 2 Received 23.650000
mf2 is a text file containing the following: hpcc.lums.edu.pk compute-0-0.local lamboot -v mf2 LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University n-1<10818> ssi:boot:base:linear: booting n0 (hpcc.lums.edu.pk) n-1<10818> ssi:boot:base:linear: booting n1 (compute-0-0.local) n-1<10818> ssi:boot:base:linear: finished mpirun -np 2 sendrecv I am 0 of 2 Sent 23.650000 I am 1 of 2 Received 23.650000 What will happen if I use np >2 ? What will happen if I use np = 1 ?
MPI_Recv is a blocking receive operation • MPI allows two different implementations for MPI_Send : buffered and un-buffered. • MPI programs must be able to run correctly regardless of which of the two methods is used for implementing MPI_Send. • Such programs are called safe.
Note: count entries of datatype int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); } ...
Avoiding Deadlocks int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); } ... If MPI_Send is blocking nonbuffered, there is a deadlock.
int main(int argc, char *argv[]) { int rank, size, source=0, dest=1; float sent[5]={10,20,30,40,50}; float recv; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);
if(rank ==0){ MPI_Send(&sent[0], 1, MPI_FLOAT, dest, 12, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[0]); MPI_Send(&sent[1], 1, MPI_FLOAT, dest, 13, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[1]); MPI_Send(&sent[2], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[2]); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, 12, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 13, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); } MPI_Finalize(); return 0; }
Rank:0 Sent 10.000000 Rank:0 Sent 20.000000 Rank:0 Sent 30.000000 Rank:1 Received 10.000000 Rank:1 Received 20.000000 Rank:1 Received 30.000000
NOTE: Unsafe: depends on whether system buffering provided or not if(rank ==0){ MPI_Send(&sent[0], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[0]); MPI_Send(&sent[1], 1, MPI_FLOAT, dest, 13, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[1]); MPI_Send(&sent[2], 1, MPI_FLOAT, dest, 12, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[2]); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, 12, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 13, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); }
Rank:0 Sent 10.000000 Rank:0 Sent 20.000000 Rank:0 Sent 30.000000 Rank:1 Received 30.000000 Rank:1 Received 20.000000 Rank:1 Received 10.000000
NOTE:Unsafe: depends on whether system buffering provided or not if(rank ==0){ MPI_Send(&sent[0], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[0]); MPI_Send(&sent[1], 1, MPI_FLOAT, dest, 14, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[1]); MPI_Send(&sent[2], 1, MPI_FLOAT, dest, 13, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[2]); MPI_Send(&sent[3], 1, MPI_FLOAT, dest, 12, MPI_COMM_WORLD); printf("Rank:%d Sent %f\n", rank, sent[3]); } else { MPI_Recv(&recv, 1, MPI_FLOAT, source, 12, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 13, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); MPI_Recv(&recv, 1, MPI_FLOAT, source, 14, MPI_COMM_WORLD, &status); printf("Rank:%d Received %f\n", rank, recv); }
Rank:0 Sent 10.000000 Rank:0 Sent 20.000000 Rank:0 Sent 30.000000 Rank:0 Sent 40.000000 Rank:1 Received 40.000000 Rank:1 Received 30.000000 Rank:1 Received 10.000000 Rank:1 Received 20.000000
Sending and Receiving Messages • MPI allows specification of wildcard arguments for both source and tag. • If source is set to MPI_ANY_SOURCE, then any process of the communication domain can be the source of the message. • If tag is set to MPI_ANY_TAG, then messages with any tag are accepted. • On the receive side, the message must be of length equal to or less than the length field specified.
Example • Numerical Integration
Numerical Integration (Serial) #include <stdio.h> main() { float integral, a, b, h, x; int n, i float f(float x); /* Function we're integrating */ printf("Enter a, b, and n\n"); scanf("%f %f %d", &a, &b, &n); h = (b-a)/n; integral = (f(a) + f(b))/2.0; x = a; for (i = 1; i <= n-1; i++) { x = x + h; integral = integral + f(x); } integral = integral*h; printf("With n = %d trapezoids, our estimate\n", n); printf("of the integral from %f to %f = %f\n", a, b, integral); } /* main */
Numerical Integration (Parallel) main(int argc, char** argv) { int my_rank; /* My process rank */ int p; /* The number of processes */ float a = 0.0; /* Left endpoint */ float b = 1.0; /* Right endpoint */ int n = 1024; /* Number of trapezoids */ float h; /* Trapezoid base length */ float local_a; /* Left endpoint my process */ float local_b; /* Right endpoint my process */ int local_n; /* Number of trapezoids for */ /* my calculation */ float integral; /* Integral over my interval */ float total; /* Total integral */ int source; /* Process sending integral */ int dest = 0; /* All messages go to 0 */ int tag = 0; MPI_Status status;
Numerical Integration (Parallel) MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); h = (b-a)/n; /* h is the same for all processes */ local_n = n/p; /* So is the number of trapezoids */ local_a = a + my_rank*local_n*h; local_b = local_a + local_n*h; integral = Trap(local_a, local_b, local_n, h); /* Add up the integrals calculated by each process */ if (my_rank == 0) { total = integral; for (source = 1; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT, source, tag, MPI_COMM_WORLD, &status); total = total + integral; }
Numerical Integration (Parallel) } else { MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD); } /* Print the result */ if (my_rank == 0) { printf("With n = %d trapezoids, our estimate\n", n); printf("of the integral from %f to %f = %f\n", a, b, total); } /* Shut down MPI */ MPI_Finalize(); } /* main */
Numerical Integration (Parallel) float Trap(float local_a, float local_b, int local_n, float h) { float integral; /* Store result in integral */ float x; int i; float f(float x); /* function */ integral = (f(local_a) + f(local_b))/2.0; x = local_a; for (i = 1; i <= local_n-1; i++) { x = x + h; integral = integral + f(x); } integral = integral*h; return integral; }
Avoiding Deadlocks Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a[10], b[10], npes, myrank; MPI_Status status; ... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); ...
Avoiding Deadlocks Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes). int a[10], b[10], npes, myrank; MPI_Status status; ... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); ... Once again, we have a deadlock if MPI_Send is blocking.
Avoiding Deadlocks We can break the circular wait to avoid deadlocks as follows: int a[10], b[10], npes, myrank; MPI_Status status; ... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank%2 == 1) { MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); } else { MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); } ...
Sending and Receiving Messages Simultaneously To exchange messages, MPI provides the following function: int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)
All-to-All broadcast in Hypercube 6 6 7 7 3 4 4 5 2 3 5 2 0 1 0 1
All-to-All broadcast in Hypercube 6,7 6 7 6,7 2,3 4,5 4 5 2 3 4,5 2,3 0 1 0,1 0,1
All-to-All broadcast in Hypercube 4,5,6,7 6 7 4,5,6,7 0,1,2,3 4,5,6,7 4 5 2 3 4,5,6,7 0,1,2,3 0 1 0,1,2,3 0,1,2,3
All-to-All broadcast in Hypercube 0,1,2,3,4,5,6,7 6 7 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 4 5 2 3 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 0 1 0,1,2,3,4,5,6,7 0,1,2,3,4, 5,6,7
Possibility of deadlock if implemented as shown and system buffering not provided.
#include <stdio.h> #include <mpi.h> #include <string.h> #include <time.h> #define MAXMSG 100 #define SINGLEMSG 10 int main(int argc, char *argv[]) { int i,j, rank, size, bytes_read, d=3, nbytes=SINGLEMSG, partner, tag=11; char *result, *received; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); received = (char *) malloc (SINGLEMSG + 1); result = (char *) malloc (MAXMSG); if (argc != (size+1)){ perror("Command line arguments missing"); MPI_Finalize(); exit(1); } strcpy(result, argv[rank+1]); for (i=0; i<d; i++){ partner = rank ^ (1<<i); MPI_Sendrecv(result, strlen(result)+1, MPI_CHAR, partner, tag, received, MAXMSG, MPI_CHAR, partner, tag, MPI_COMM_WORLD, &status); printf("I am node %d: Sent %s\t Received %s\n", rank, result, received); strcat(result, received); } printf("I am node %d: My final result is %s\n", rank, result); MPI_Finalize(); return 0; }
int main(int argc, char *argv[]){ // initializations MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); received = (char *) malloc (SINGLEMSG + 1); result = (char *) malloc (MAXMSG); //error checks strcpy(result, argv[rank+1]); for (i=0; i<d; i++){ partner = rank ^ (1<<i); MPI_Sendrecv(result, strlen(result)+1, MPI_CHAR, partner, tag, received, MAXMSG, MPI_CHAR, partner, tag, MPI_COMM_WORLD, &status); printf("I am node %d: Sent %s\t Received %s\n", rank, result, received); strcat(result, received); } printf("I am node %d: My final result is %s\n", rank, result); MPI_Finalize(); return 0; }
mpirun -np 8 hbroadcast "one " "two " "three " "four " "five " "six " "seven " "eight " I am node 0: Sent one Received two I am node 4: Sent five Received six I am node 5: Sent six Received five I am node 1: Sent two Received one I am node 3: Sent four Received three I am node 2: Sent three Received four I am node 0: Sent one two Received three four I am node 3: Sent four three Received two one I am node 7: Sent eight Received seven I am node 1: Sent two one Received four three I am node 2: Sent three four Received one two I am node 7: Sent eight seven Received six five I am node 6: Sent seven Received eight
I am node 0: Sent one two three four Received five six seven eight I am node 0: My final result is one two three four five six seven eight I am node 5: Sent six five Received eight seven I am node 6: Sent seven eight Received five six I am node 7: Sent eight seven six five Received four three two one I am node 3: Sent four three two one Received eight seven six five I am node 5: Sent six five eight seven Received two one four three I am node 4: Sent five six Received seven eight I am node 3: My final result is four three two one eight seven six five I am node 1: Sent two one four three Received six five eight seven I am node 1: My final result is two one four three six five eight seven I am node 5: My final result is six five eight seven two one four three I am node 6: Sent seven eight five six Received three four one two I am node 4: Sent five six seven eight Received one two three four I am node 4: My final result is five six seven eight one two three four I am node 2: Sent three four one two Received seven eight five six I am node 7: My final result is eight seven six five four three two one I am node 2: My final result is three four one two seven eight five six I am node 6: My final result is seven eight five six three four one two