1 / 35

CS 179: GPU Programming

CS 179: GPU Programming. Lecture 14: Inter-process Communication. The Problem. What if we want to use GPUs across a distributed system?. GPU cluster, CSIRO. Distributed System. A collection of computers Each computer has its own local memory! Communication over a network.

zareh
Download Presentation

CS 179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 179: GPU Programming Lecture 14: Inter-process Communication

  2. The Problem • What if we want to use GPUs across a distributed system? GPU cluster, CSIRO

  3. Distributed System • A collection of computers • Each computer has its own local memory! • Communication over a network • Communication suddenly becomes harder! (and slower!) • GPUs can’t be trivially used between computers

  4. We need some way to communicate between processes…

  5. Message Passing Interface (MPI) • A standard for message-passing • Multiple implementations exist • Standard functions that allow easy communication of data between processes • Works on non-networked systems too! • (Equivalent to memory copying on local system)

  6. MPI Functions • There are seven basic functions: • MPI_Initinitialize MPI environment • MPI_Finalizeterminate MPI environment • MPI_Comm_sizehow many processes we have running • MPI_Comm_rankthe ID of our process • MPI_Isendsend data (nonblocking) • MPI_Irecvreceive data (nonblocking) • MPI_Waitwait for request to complete

  7. MPI Functions • Some additional functions: • MPI_Barrierwait for all processes to reach a certain point • MPI_Bcastsend data to all other processes • MPI_Reducereceive data from all processes and reduce to a value • MPI_Sendsend data (blocking) • MPI_Recvreceive data (blocking)

  8. The workings of MPI • Big idea: • We have some process ID • We know how many processes there are • A send request/receive request pair are necessary to communicate data!

  9. MPI Functions – Init • intMPI_Init( int *argc, char ***argv ) • Takes in pointers to command-line parameters (more on this later)

  10. MPI Functions – Size, Rank • intMPI_Comm_size( MPI_Commcomm, int *size ) • intMPI_Comm_rank( MPI_Commcomm, int *rank ) • Take in a “communicator” • For us, this will be MPI_COMM_WORLD – essentially, our communication group contains all processes • Take in a pointer to where we will store our size (or rank)

  11. MPI Functions – Send • intMPI_Isend(const void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) • Non-blocking send • Program can proceed without waiting for return! • (Can use MPI_Sendfor “blocking” send)

  12. MPI Functions – Send • intMPI_Isend(const void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… • A pointer to what we send • Size of data • Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) • Destination process ID …

  13. MPI Functions – Send • intMPI_Isend(const void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… (continued) • A “tag” – nonnegative integers 0-32767+ • Uniquely identifies our operation – used to identify messages when many are sent simultaneously • MPI_ANY_TAG allows all possible incoming messages • A “communicator” (MPI_COMM_WORLD) • A “request” object • Holds information about our operation

  14. MPI Functions - Recv • intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Request *request) • Non-blocking receive (similar idea) • (Can use MPI_Irecvfor “blocking” receive)

  15. MPI Functions - Recv • intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, intsource, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… • A buffer (pointer to where received data is placed) • Size of incoming data • Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) • Sender’s process ID …

  16. MPI Functions - Recv • intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… • A “tag” • Should be the same tag as the tag used to send message! (assuming MPI_ANY_TAG not in use) • A “communicator” (MPI_COMM_WORLD) • A “request” object

  17. Blocking vs. Non-blocking • MPI_Isend and MPI_Irecv are non-blocking communications • Calling these functions returns immediately • Operation may not be finished! • Should use MPI_Wait to make sure operations are completed • Special “request” objects for tracking status • MPI_Send and MPI_Recv are blocking communications • Functions don’t return until operation is complete • Can cause deadlock! • (we won’t focus on these)

  18. MPI Functions - Wait • intMPI_Wait(MPI_Request *request, MPI_Status *status) • Takes in… • A “request” object corresponding to a previous operation • Indicates what we’re waiting on • A “status” object • Basically, information about incoming data

  19. MPI Functions - Reduce • intMPI_Reduce(constvoid *sendbuf, void *recvbuf, int count, MPI_Datatypedatatype, MPI_Opop, int root, MPI_Commcomm) • Takes in… • A “send buffer” (data obtained from every process) • A “receive buffer” (where our final result will be placed) • Number of elements in send buffer • Can reduce element-wise array -> array • Type of data (MPI label, as before) …

  20. MPI Functions - Reduce • intMPI_Reduce(constvoid *sendbuf, void *recvbuf, int count, MPI_Datatypedatatype, MPI_Opop, int root, MPI_Commcomm) • Takes in… (continued) • Reducing operation (special MPI labels, e.g. MPI_SUM, MPI_MIN) • ID of process that obtains result • MPI communication object (as before)

  21. MPI Example intmain(intargc, char **argv) { intrank, numprocs; MPI_Statusstatus; MPI_Requestrequest; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); inttag=1234; intsource=0; intdestination=1; intcount=1; intsend_buffer; intrecv_buffer; if(rank == source){ send_buffer=5678; MPI_Isend(&send_buffer,count,MPI_INT,destination,tag, MPI_COMM_WORLD,&request); } if(rank == destination){ MPI_Irecv(&recv_buffer,count,MPI_INT,source,tag, MPI_COMM_WORLD,&request); } MPI_Wait(&request,&status); if(rank == source){ printf("processor %d sent %d\n",rank,recv_buffer); } if(rank == destination){ printf("processor %d got %d\n",rank,recv_buffer); } MPI_Finalize(); return 0; } • Two processes • Sends a number from process 0 to process 1 • Note: Both processes are running this code!

  22. Solving Problems • Goal: To use GPUs across multiple computers! • Two levels of division: • Divide work between processes (using MPI) • Divide work among a GPU (using CUDA, as usual) • Important note: • Single-program, multiple-data (SPMD) for our purposes i.e. running multiple copies of the same program!

  23. Multiple Computers, Multiple GPUs • Can combine MPI, CUDA streams/device selection, and single-GPU CUDA!

  24. MPI/CUDA Example • Suppose we have our “polynomial problem” again • Suppose we have n processes for n computers • Each with one GPU • (Suppose array of data is obtained through process 0) • Every process will run the following:

  25. MPI/CUDA Example MPI_Init declare rank, numberOfProcesses set rank with MPI_Comm_rank set numberOfProcesses with MPI_Comm_size if process rank is 0: array <- (Obtain our raw data) for all processesidx= 0 through numberOfProcesses - 1: Send fragment idxof the array to process ID idxusing MPI_Isend This fragment will have the size (size of array) / numberOfProcesses) declare local_data Read into local_datafor our process using MPI_Irecv MPI_Wait for our process to finish receive request local_sum <- ... Process local_datawith CUDA and obtain a sum declare global_sum set global_sumon process 0 with MPI_reduce (use global_sum somehow) MPI_Finalize

  26. MPI/CUDA – Wave Equation • Recall our calculation:

  27. MPI/CUDA – Wave Equation • Big idea: Divide our data array between n processes!

  28. MPI/CUDA – Wave Equation • In reality: We have three regions of data at a time (old, current, new)

  29. MPI/CUDA – Wave Equation • Calculation for timestep t+1 uses the following data: t+1 t t-1

  30. MPI/CUDA – Wave Equation • Problem if we’re at the boundary of a process! t+1 t t-1 x Where do we get ? (It’s outside our process!)

  31. Wave Equation – Simple Solution • After every time-step, each process gives its leftmost and rightmost piece of “current” data to neighbor processes! Proc1 Proc4 Proc0 Proc2 Proc3

  32. Wave Equation – Simple Solution • Pieces of data to communicate: Proc1 Proc4 Proc0 Proc2 Proc3

  33. Wave Equation – Simple Solution • Can do this with MPI_Irecv, MPI_Isend, MPI_Wait: • Suppose process has rank r: • If we’re not the rightmost process: • Send data to process r+1 • Receive data from process r+1 • If we’re not the leftmost process: • Send data to process r-1 • Receive data from process r-1 • Wait on requests

  34. Wave Equation – Simple Solution • Boundary conditions: • Use MPI_Comm_rank and MPI_Comm_size • Rank 0 process will set leftmost condition • Rank (size-1) process will set rightmost condition

  35. Simple Solution – Problems • Communication can be expensive! • Expensive to communicate every timestep to send 1 value! • Better solution: Send some m values every mtimesteps!

More Related