560 likes | 699 Views
Parallel Programming in MPI. January 8 , 2013. 1. 今日の実験環境. ホスト名 sandy.cc.kyushu-u.ac.jp ユーザ ID とパスワード 後で通知します。. 前回の講義でのプロセス並列プログラム例. 処理だけでなくデータも分割. プロセス0. プロセス1. プロセス2. プロセス3. 0. 24. 0. 24. 0. 24. 0. 24. A. A. A. A. =. =. =. =. =. =. =. =. =. =. =. =. =. =.
E N D
Parallel Programming in MPI January 8, 2013 1
今日の実験環境 • ホスト名 sandy.cc.kyushu-u.ac.jp • ユーザIDとパスワード後で通知します。
前回の講義でのプロセス並列プログラム例 • 処理だけでなくデータも分割 プロセス0 プロセス1 プロセス2 プロセス3 ... ... ... ... 0 24 0 24 0 24 0 24 A A A A = = = = = = = = = = = = = = = = = = = = B B B B + + + + + + + + + + + + + + + + + + + + C C C C double A[25],B[25],C[25]; ... for (i=0;i<25;i++) A[i] = B[i] + C[i]; double A[25],B[25],C[25]; ... for (i=0;i<25;i++) A[i] = B[i] + C[i]; double A[25],B[25],C[25]; ... for (i=0;i<25;i++) A[i] = B[i] + C[i]; double A[25],B[25],C[25]; ... for (i=0;i<25;i++) A[i] = B[i] + C[i]; プロセス0 プロセス1 プロセス2 プロセス3 プロセス0で、プロセス1の A[10]を参照したい場合? ⇒ プロセス間通信
どうやって、プログラムに通信を記述するか?How to Describe Communications in a Program? • TCP, UDP ? • Good:- 多くのネットワークに実装されており,可搬性が高い. Portable: Available on many networks. • Bad:- 接続やデータ転送の手続きが複雑Protocols for connections and data-transfer are complicated.- 広域ネットワークを対象に設計されており,オーバーヘッドが大きい.High overhead, since they are designed for wide-area (= unreliable) networks. 記述可能だが,並列処理には適さないPossible. But not suitable for parallel processing.
MPI (Message Passing Interface) • 並列計算向けに設計された通信関数群A set of communication functions designed for parallel processing • C, C++, Fortranのプログラムから呼び出しCan be called from C/C++/Fortran programs. • "Message Passing" = Send + Receive • 実際には,Send, Receive 以外にも多数の関数を利用可能.Actually, more functions other than Send and Receive are available. • ともかく、プログラム例を見てみましょうLet's see a sample program, first.
#include <stdio.h> #include "mpi.h" int main(intargc, char *argv[]) { intmyid, procs, ierr, i; double myval, val; MPI_Status status; FILE *fp; char s[64]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); if (myid == 0) { fp = fopen("test.dat", "r"); fscanf(fp, "%lf", &myval); for (i = 1; i < procs; i++){ fscanf(fp, "%lf", &val); MPI_Send(&val, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD); } fclose(fp); } else MPI_Recv(&myval, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status); printf("PROCS: %d, MYID: %d, MYVAL: %e\n", procs, myid, myval); MPI_Finalize(); return 0; } Setup MPI environment Get own process ID (= rank) Get total number of processes If my ID is 0 input data for this process and keep it in myval i = 1~procs-1 input data and keep it in val useMPI_Send to send value in valto process i processes with ID other than 0use MPI_Recv to receive data from process 0 and keep it in myval print-out its own myval end of parallel computing 6 6
プログラム例の実行の流れFlow of the sample program. rank 0 rank 1 read data from a file myval rank 2 receive datafrom rank 0 receive datafrom rank 0 read datafrom a file val send val to rank 1 wait for the arrival of the data myval read data from a file val print myval wait for the arrival of the data send val to rank 2 myval print myval print myval 複数の"プロセス"が,自分の番号(ランク)に応じて実行Multiple "Processes" execute the program according to their number (= rank). 7 7 7
実行例Sample of the Result of Execution • 各プロセスがそれぞれ勝手に表示するので、表示の順番は毎回変わる可能性がある。The order of the output can be different,since each process proceeds execution independently. PROCS: 4 MYID: 1 MYVAL: 20.0000000000000000 PROCS: 4 MYID: 2 MYVAL: 30.0000000000000000 PROCS: 4 MYID: 0 MYVAL: 10.0000000000000000 PROCS: 4 MYID: 3 MYVAL: 40.0000000000000000 rank 1 rank 2 rank 0 rank 3
MPIインタフェースの特徴Characteristics of MPIInterface • MPI プログラムは,普通の C言語プログラムMPI programs are ordinal programs in C-language • Not a new language • 各プロセスが同じプログラムを実行するEvery process execute the same program • ランク(=プロセス番号)を使って,プロセス毎に違う仕事を実行Each process executes its own work according to its rank(=process number) • 他のプロセスの変数を直接見ることはできない。A process cannot read or write variables on other process directly Rank 0 Read file myval Rank 1 Read file val Rank 2 Receive Send Receive myval Read file val Print myval Send myval Print myval 9 Print myval
TCP, UDP vs MPI • MPI:並列計算に特化したシンプルな通信インタフェースSimple interface dedicated for parallel computing • SPMD(Single Program Multiple Data-stream) model • 全プロセスが同じプログラムを実行All processes execute the same program • TCP, UDP: 各種サーバ等,様々な用途を想定した汎用的な通信インタフェースGeneric interface for various communications,such as internet servers • Server/Client model • 各プロセスが自分のプログラムを実行 Each process executes its own program.
MPI TCP Client #include <stdio.h> #include "mpi.h" int main(int argc, char *argv[]) { int myid, procs, ierr, i; double myval, val; MPI_Status status; FILE *fp; char s[64]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); if (myid == 0) { fp = fopen("test.dat", "r"); fscanf(fp, "%lf", &myval); for (i = 1; i < procs; i++){ fscanf(fp, "%lf", &val); MPI_Send(&val, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD); } fclose(fp); } else MPI_Recv(&myval, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status); printf("PROCS: %d, MYID: %d, MYVAL: %e\n", procs, myid, myval); MPI_Finalize(); return 0; } sock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP); memset(&echoServAddr, 0, sizeof(echoServAddr)); echoServAddr.sin_family = AF_INET; echoServAddr.sin_addr.s_addr = inet_addr(servIP); echoServAddr.sin_port = htons(echoServPort); connect(sock, (struct sockaddr *) &echoServAddr, sizeof(echoServAddr)); echoStringLen = strlen(echoString); send(sock, echoString, echoStringLen, 0); totalBytesRcvd = 0; printf("Received: "); while (totalBytesRcvd < echoStringLen){ bytesRcvd = recv(sock, echoBuffer, RCVBUFSIZE - 1, 0); totalBytesRcvd += bytesRcvd; echoBuffer[bytesRcvd] = '\0' ; printf(echoBuffer); } printf("\n"); close(sock); initialize initialize TCP Server servSock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP); memset(&echoServAddr, 0, sizeof(echoServAddr)); echoServAddr.sin_family = AF_INET; echoServAddr.sin_addr.s_addr = htonl(INADDR_ANY); echoServAddr.sin_port = htons(echoServPort); bind(servSock, (struct sockaddr *) &echoServAddr, sizeof(echoServAddr)); listen(servSock, MAXPENDING); for (;;){ clntLen = sizeof(echoClntAddr); clntSock = accept(servSock,(struct sockaddr *)&echoClntAddr, &clntLen); recvMsgSize = recv(clntSock, echoBuffer, RCVBUFSIZE, 0); while (recvMsgSize > 0){ send(clntSock, echoBuffer, recvMsgSize, 0); recvMsgSize = recv(clntSock, echoBuffer, RCVBUFSIZE, 0); } close(clntSock); } initialize
MPIの位置づけLayer of MPI • ネットワークの違いを、MPIが隠ぺいHide the differences of networks Applications MPI Sockets XTI … … … TCP UDP IP High-Speed Interconnect(InfiniBand, etc.) Ethernetdriver, Ethernetcard
MPIプログラムのコンパイルHow to compile MPI programs • Compilecommand: mpicc Example)mpicc -O3 test.c -o test optimization optionO is not 0 source file to compile executable file to create
MPIプログラムの sandyでの実行How to execute MPI programs (on 'sandy') Number of Nodes(ex: 2) • Prepare a script file • Submit the script fileqsub test.sh • Other commands • qstat (= check status), qdeljob_number (= cancel job) Sample: Number of Processes per node (ex: 4) #!/bin/sh #PBS -l nodes=2:ppn=4,walltime=00:10:00 #PBS -j oe #PBS -q middle cd $PBS_O_WORKDIR mpirun -np 8 ./test-mpi Maximum Execution Time (ex: 10min.) Job Queue Commands to be Executed cd to the directory from where this job is submitted Run MPI program with specified number (ex: 8) of processes
Ex 0)MPIプログラムの実行 Execution of an MPI program sandyにログインして、以下を実行しなさい。Login to sandy, and try the following commands. 時間に余裕があったら,プロセス数を変えたり,プログラムを書き換えたりしてみる.Try changing the number of processes,or modifying the source program. $ cp /tmp/test-mpi.c . $ cp /tmp/test.dat . $ cp /tmp/test.sh . $ cat test-mpi.c $ cat test.dat $ mpicc test-mpi.c –o test-mpi $ qsub test.sh wait for a while $ ls (check the name of the result file (test.sh.o????)) $ less test.sh.o????
MPIライブラリMPI Library • MPI関数の実体は,MPIライブラリに格納されているThe bodies of MPI functions are in "MPI Library". • mpicc が自動的に MPIライブラリをプログラムに結合するmpicc links the library to the program mpicc main() { MPI_Init(...); ... MPI_Comm_rank(...); ... MPI_Send(...); ...} link Executablefile compile MPI_Init MPI_Comm_rank ... MPI Library source program
MPIプログラムの基本構造Basic Structure of MPI Programs Crucial lines header file "mpi.h" #include <stdio.h> #include "mpi.h" int main(int argc, char *argv[]) { ... MPI_Init(&argc, &argv); ... MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); ... MPI_Send(&val, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD); ... MPI_Recv(&myval, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status); ... MPI_Finalize(); return 0; } Function for start-up You can call MPI functions in this area Functions for finish
基本的な MPI関数Basic Functions of MPI • MPI_Init • Initialization • MPI_Finalize • Finalization • MPI_Comm_size • Get number of processes • MPI_Comm_rank • Get rank (= Process number) of this process • MPI_Send & MPI_Recv • Message Passing
MPI_Init Usage: int MPI_Init(int *argc, char **argv); • MPIの並列処理開始Start parallel execution of in MPI • プロセスの起動やプロセス間通信路の確立等。Start processes and establish connectionsamong them. • 他のMPI関数を呼ぶ前に、必ずこの関数を呼ぶ。Most be called once before calling otherMPI functions • 引数:Parameter: • main関数の2つの引数へのポインタを渡す。Specify pointers of both of the arguments of 'main' function. • 各プロセス起動時に実行ファイル名やオプションを共有するために参照。Each process most share the name of the executable file, and the options given to the mpirun command. Example #include <stdio.h> #include "mpi.h" int main(int argc, char *argv[]) { int myid, procs, ierr; double myval, val; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); ... 19 19
MPI_Finalize Usage: int MPI_Finalize(); • 並列処理の終了Finishes paralles execution • このルーチン実行後はMPIルーチンを呼び出せないMPI functions cannot be calledafter this function. • プログラム終了前に全プロセスで必ずこのルーチンを実行させる。Every process needs to call this function before exitting the program. Example main() { ... MPI_Finalize(); } 20 20
MPI_Comm_rank Usage: int MPI_Comm_rank(MPI_Comm comm, int *rank); • そのプロセスのランクを取得するGet the rank(= process number) of the process • 2番目の引数に格納Returned in the second argument • 最初の引数 = “コミュニケータ”1st argument = "communicator" • プロセスのグループを表す識別子An identifier for the group of processes • 通常は,MPI_COMM_WORLD を指定In most cases, just specify MPI_COMM_WORLD, here. • MPI_COMM_WORLD: 実行に参加する全プロセスによるグループa group that consists all of the processes in this execution • プロセスを複数のグループに分けて、それぞれ別の仕事をさせることも可能Processes can be devided into multiple groups and attached different jobs. Example ... MPI_Comm_rank(MPI_COMM_WORLD, &myid); ... 21 21
MPI_Comm_size Usage: int MPI_Comm_size(MPI_Comm comm, int *size); • プロセス数を取得するGet the number of processes • 2番目の引数に格納される Example ... MPI_Comm_size(MPI_COMM_WORLD, &procs); ... 22 22
一対一通信Message Passing • 送信プロセスと受信プロセスの間で行われる通信Communication between "sender" and "receiver" • 送信関数と受信関数を,"適切"に呼び出す.Functions of Sending and Receiving most be called in a correct manner. • "From" rank and "To" rank are correct • Specified size of the data to be transferred is the same on both side • Same "Tag" is specified on both side Rank 1 Rank 0 Receive From: Rank 0 Size: 10 Integer data Tag: 100 Send To: Rank 1 Size: 10 Integer data Tag: 100 Wait for the message
MPI_Send Usage: int MPI_Send(void *b, int c, MPI_Datatype d, int dest,int t, MPI_Comm comm); • 送信内容Information of the message to send • start address of the data 開始アドレス,number of elements 要素数,data type データ型,rank of the destination 送信先,tag,communicator (= MPI_COMM_WORLD, in most cases) • data types: • tag: メッセージに付ける番号(整数) The number attached to each message • 不特定のプロセスから届く通信を処理するタイプのプログラムで使用Used in a kind of programs that handles anonymous messages. • 通常は、0 を指定しておいて良い. Usually, you can specify 0. Example ... MPI_Send(&val, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD); ... 24 24
Example of MPI_Send • 整数変数 d の値を送信(整数1個)Send the value of an integer variable 'd' • 実数配列 mat の最初の要素から100番目の要素までを送信Send first 100 elements of array 'mat' (with MPI_DOUBLE type) • 整数配列 data の10番目の要素から50個を送信Send elements of an integer array 'data' from 10th to 59th element MPI_Send(&d, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); MPI_Send(mat, 100, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); MPI_Send(&(data[10]), 50, MPI_INT, 1, 0, MPI_COMM_WORLD);
MPI_Recv Usage: int MPI_Recv(void *b, int c, MPI_Datatype d, int src, int t, MPI_Comm comm, MPI_Status *st); • Information of the message to receive • start address for storing data 受信データ格納用の開始アドレス,number of elements 要素数,data type データ型,rank of the source 送信元,tag (= 0, in most cases), communicator (= MPI_COMM_WORLD, in most cases),status • status: メッセージの情報を格納する整数配列An integer array for storing the information of arrived message • 送信元ランクやタグの値を参照可能(通常は、あまり使わない)Consists the information about the source rank and the tag. ( Not be used in most case ) Example ... MPI_Recv(&myval, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD &status); ... 26 26
Ex 1) 乱数をランク順通りに表示するプログラムA program that displays random numbers 「各プロセスがそれぞれ整数乱数を一つ生成し、 それらをランク順に表示するプログラム」を作成しなさい。Make a program in which each process creates one random number and display the numbers in 'rank' order. Example of the output: 0: 1524394631 1: 999094501 2: 941763604 3: 526956378 4: 152374643 5: 1138154117 6: 1926814754 7: 156004811
Sample Program 1: Create and Display a Random Number #include <stdio.h> #include <stdlib.h> #include <sys/time.h> int main(int argc, char *argv[]) { int r; struct timeval tv; gettimeofday(&tv, NULL); srand(tv.tv_usec); r = rand(); printf("%d\n", r); }
Sample program 2:Display Random Numbers (with no sort) #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mpi.h" int main(int argc, char *argv[]) { int r, myid, procs; struct timeval tv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); gettimeofday(&tv, NULL); srand(tv.tv_usec); r = rand(); printf("%d: %d\n", myid, r); MPI_Finalize(); }
Hint: • 表示するプロセスを一つにすると、表示順をコントロールできるOrder of output can be controlled by letting only oneprocess to print out.
Sample answer if (myid == 0){ printf("%d: %d\n", myid, r); for (i = 1; i < procs; i++){ MPI_Recv(&r, 1, MPI_INT, i, 0, MPI_COMM_WORLD, &status); printf("%d: %d\n", i, r); } } else{ MPI_Send(&r, 1, MPI_INT, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); } #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mpi.h" int main(intargc, char *argv[]) { int r, myid, procs; structtimevaltv; inti; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); gettimeofday(&tv, NULL); srand(tv.tv_usec); r = rand();
より高機能な MPI関数Advanced Functions of MPI 集団通信Collective Communication (= Group Communication) ノンブロッキング通信Non-Blocking Communication 通信の完了を待つ間に他の処理を行うExecute other instructions while waiting for the completion of a communication.
集団通信Collective Communications • グループ内の全プロセスで行う通信Communications among all of the processes in the group • Examples) • MPI_Bcast • copy a data to otherprocesses • MPI_Gather • Gather data from other processesto an array • MPI_Reduce • Apply a 'Reduction'operation to the distributed datato produce one array Rank 0 Rank 1 Rank 2 3 1 8 2 3 1 8 2 3 1 8 2 Rank 0 Rank 1 Rank 2 7 5 9 7 5 9 Rank 0 Rank 1 Rank 2 1 2 3 4 5 6 7 8 9 12 15 18
MPI_Bcast Usage: int MPI_Bcast(void *b, int c, MPI_Datatype d, int root, MPI_Comm comm); • あるプロセスのデータを全プロセスにコピーcopy a data on a process to all of the processes • Parameters: • start address, number of elements, data type, root rank, communicator • root rank: コピー元のデータを所有するプロセスのランクrank of the process that has the original data • Example:MPI_Bcast(a, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD); Rank1 Rank2 Rank3 Rank0 a a a a 34 34
MPI_Gather Usage: int MPI_Gather(void *sb, int sc MPI_Datatype st, void *rb, int rc, MPI_Datatype rt, int root, MPI_Comm comm); • 全プロセスからデータを集めて一つの配列を構成Gather data from other processes to construct an array • Parameters: • send data: start address, number of elements, data type, receive data: start address, number of elements, data type, (means only on the root rank)root rank, communicator • root rank: 結果の配列を格納するプロセスのランクrank of the process that stores the result array • Example: MPI_Gather(a, 3, MPI_DOUBLE, b, 3, MPI_DOUBLE, 0, MPI_COMM_WORLD); Rank0 Rank1 Rank2 Rank3 a a a a b 35 35
集団通信の利用に当たってUsage of Collective Communications • 同じ関数を全プロセスが実行するよう、記述する。Every process must call the same function • 例えば MPI_Bcastは,rootrankだけでなく全プロセスで実行For example, MPI_Bcast must be called not only by the root rank but also all of the other ranks • 送信データと受信データの場所を別々に指定するタイプの集団通信では、送信データの範囲と受信データの範囲が重ならないように指定する。On functions that require information of both send and receive, the specified ranges of the addresses for sending and receiving cannot be overlapped. • MPI_Gather, MPI_Allgather, MPI_Gatherv, MPI_Allgatherv, MPI_Recude, MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv,etc. 36 36
ノンブロッキング通信関数 Non-blocking communication functions • ノンブロッキング = ある命令の完了を待たずに次の命令に移るNon-blocking = Do not wait for the completion of an instruction and proceed to the next instruction • Example) MPI_Irecv& MPI_Wait Blocking Non-Blocking MPI_Recv MPI_Irecv Proceed to the next instruction without waiting for the data next instructions Wait for the arrival of data data data MPI_Wait next instructions
MPI_Irecv Usage: int MPI_Irecv(void *b, int c, MPI_Datatype d, int src, int t, MPI_Comm comm, MPI_Request *r); • Non-Blocking Receive • Parameters:start address for storing received data,number of elements, data type,rank of the source, tag (= 0, in most cases), communicator (= MPI_COMM_WORLD, in most cases),request • request: 通信要求 Communication Request • この通信の完了を待つ際に用いるUsed for Waiting completion of this communication • Example)MPI_Requestreq; ...MPI_Irecv(a, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &req); ...MPI_Wait(&req, &status); 38 38
MPI_Isend Usage: int MPI_Isend(void *b, int c, MPI_Datatype d, int dest,int t, MPI_Comm comm, MPI_Request *r); • Non-Blocking Send • Parameters:start address for sending data,number of elements, data type,rank of the destination, tag (= 0, in most cases), communicator (= MPI_COMM_WORLD, in most cases),request • Example)MPI_Requestreq; ...MPI_Isend(a, 100, MPI_INT, 1, 0, MPI_COMM_WORLD, &req); ...MPI_Wait(&req, &status); 39 39
Non-Blocking Send? • Blocking send (MPI_Send):送信データが別の場所にコピーされるのを待つ Wait for the data to be copied to somewhere else. • ネットワークにデータを送出し終わるか、一時的にデータのコピーを作成するまで。Until completion of the data to be transferred to the network or, until completion of the data to be copied to a temporal memory. • Non-Blocking send (MPI_Recv):待たない
Notice: ノンブロッキング通信中はデータが不定Data is not sure in non-blocking communications • MPI_Irecv: • 受信データの格納場所と指定した変数の値は MPI_Waitまで不定Value of the variable specified for receiving data is not fixed before MPI_Wait A arriveddata MPI_Irecvto A 10 ... ~ = A ... A 50 Value of A at herecan be 10 or 50 50 MPI_Wait Value of A is 50 ~ = A
Notice: ノンブロッキング通信中はデータが不定Data is not sure in non-blocking communications • MPI_Isend: • 送信データを格納した変数を MPI_Waitより前に書き換えると、実際に送信される値は不定If the variable that stored the data to be sent is modified before MPI_Wait, the value to be actually sent is unpredictable. A MPI_Isend A Modifying value of A here causes incorrect communication 10 data sent ... A = 50 ... A 10 or 50 50 MPI_Wait You can modify value of A at here without any problem A = 100
MPI_Wait Usage: int MPI_Wait(MPI_Request *req, MPI_Status *stat); • ノンブロッキング通信(MPI_Isend、MPI_Irecv)の完了を待つ。Wait for the completion of MPI_Isend or MPI_Irecv • 送信データの書き換えや受信データの参照が行えるMake sure that sending data can be modified,or receiving data can be referred. • Parameters:request, status • status:MPI_Irecv完了時に受信データの statusを格納The status of the received data is stored at the completion of MPI_Irecv
MPI_Waitall Usage: int MPI_Waitall(int c, MPI_Request *requests, MPI_Status *statuses); • 指定した数のノンブロッキング通信の完了を待つWait for the completion of specified number of non-blocking communications • Parameters:count, requests, statuses • count:ノンブロッキング通信の数The number of non-blocking communications • requests, statuses:少なくとも count個の要素を持つ MPI_Requestと MPI_Statusの配列Arrays of MPI_Request or MPI_Status that consists at least 'count' number of elements.
集団通信関数の中身Inside ofthe functions of collective communications • 通常,集団通信関数は,MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv等の一対一通信で実装されるUsually, functions of collective communications are implemented by using message passing functions.
Inside of MPI_Bcast int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm) { int i, myid, procs; MPI_Status st; MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); if (myid == root){ for (i = 0; i < procs) if (i != root) MPI_Send(a, c, d, i, 0, comm); } else{ MPI_Recv(a, c, d, root, 0, comm, &st); } return 0; } • One of the most simple implementations
Another implementation: With MPI_Isend int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm) { int i, myid, procs, cntr; MPI_Status st, *stats; MPI_Request *reqs; MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); if (myid == root){ stats = (MPI_Status *)malloc(sizeof(MPI_Status)*procs); reqs = (MPI_Request *)malloc(sizeof(MPI_Request)*procs); cntr = 0; for (i = 0; i < procs) if (i != root) MPI_Isend(a, c, d, i, 0, comm, &(reqs[cntr++])); MPI_Waitall(procs-1, reqs, stats); free(stats); free(reqs); } else{ MPI_Recv(a, c, d, root, 0, comm, &st); } return 0; }
Another implementation: Binomial Tree int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm) { int i, myid, procs; MPI_Status st; int mask, relative_rank, src, dst; int tag = 1, success = 0; MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); relative_rank = myid - root; if (relative_rank < 0) relative_rank += procs; mask = 1; while (mask < num_procs){ if (relative_rank & mask){ src = myid - mask; if (src < 0) src += procs; MPI_Recv(a, c, d, src, 0, comm, &st); break; } mask <<= 1; } mask >>= 1; while (mask > 0){ if (relative_rank + mask < procs){ dst = myid + mask; if (dst >= procs) dst -= procs; MPI_Send (a, c, d, dst, 0, comm); } mask >>= 1; } return 0; }
Flow of Binomial Tree • Use 'mask' to determine when and how to Send/Recv Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 1 mask = 2 mask = 2 mask = 2 mask = 2 Recv from 6 Recv from 4 Recv from 0 Recv from 2 mask = 4 mask = 4 Recv from 4 Recv from 0 Recv from 0 mask = 4 Send to 4 mask = 2 Send to 6 mask = 2 mask = 1 mask = 1 Send to 2 Send to 5 Send to 7 mask = 1 Send to 3 mask = 1 Send to 1
Deadlock 何らかの理由で、プログラムを進行させることができなくなった状態A status of a program in which it cannot proceed by some reasons. MPIプログラムでデッドロックが発生しやすい場所:Places you need to be careful for deadlocks:1. MPI_Recv, MPI_Wait, MPI_Waitall 2. Collective communications 全部のプロセスが同じ集団通信関数を実行するまで先に進めないA program cannot proceed until all processes call the same collective communication function Wrong case: One solution: use MPI_Irecv if (myid == 0){ MPI_Recv from rank 1 MPI_Send to rank 1 } if (myid == 1){ MPI_Recv from rank 0 MPI_Send to rank 0 } if (myid == 0){ MPI_Irecv from rank 1 MPI_Send to rank 1 MPI_Wait } if (myid == 1){ MPI_Irecv from rank 0 MPI_Send to rank 0 MPI_Wait }