1 / 35

Parallel Programming

Parallel Programming. AMANO, Hideharu. Parallel Programming . Message Passing PVM MPI Shared Memory POSIX thread OpenMP CUDA/OpenCL Automatic Parallelizing Compilers. Send. Receive. Send. Receive. Message passing ( Blocking: randezvous ). Send. Receive. Send. Receive.

azuka
Download Presentation

Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming AMANO, Hideharu

  2. ParallelProgramming • Message Passing • PVM • MPI • Shared Memory • POSIX thread • OpenMP • CUDA/OpenCL • Automatic Parallelizing Compilers

  3. Send Receive Send Receive Message passing(Blocking: randezvous)

  4. Send Receive Send Receive Message passing(with buffer)

  5. Send Receive Message passing(non-blocking) Other Job

  6. PVM (ParallelVirtualMachine) • A buffer is provided for a sender. • Both blocking/non-blocking receive is provided. • Barrier synchronization

  7. MPI(MessagePassingInterface) • Superset of the PVM for 1 to 1 communication. • Group communication • Various communication is supported. • Error check with communication tag.

  8. Programming style using MPI • SPMD (Single Program Multiple Data Streams) • Multiple processes executes the same program. • Independent processing is done based on the process number. • Program execution using MPI • Specified number of processes are generated. • They are distributed to each node of the NORA machine or PC cluster.

  9. Communication methods • Point-to-Point communication • A sender and a receiver executes function for sending and receiving. • Each function must be strictly matched. • Collective communication • Communication between multiple processes. • The same function is executed by multiple processes. • Can be replaced with a sequence of Point-to-Point communication, but sometimes effective.

  10. Fundamental MPI functions • Most programs can be described using six fundamental functions • MPI_Init()… MPI Initialization • MPI_Comm_rank()… Get the process # • MPI_Comm_size()… Get the total process # • MPI_Send()… Message send • MPI_Recv()… Message receive • MPI_Finalize()… MPI termination

  11. Other MPI functions • Functions for measurement • MPI_Barrier() … barrier synchronization • MPI_Wtime() … get the clock time • Non-blocking function • Consisting of communication request and check • Other calculation can be executed during waiting.

  12. An Example 1: #include <stdio.h> 2: #include <mpi.h> 3: 4: #define MSIZE 64 5: 6: int main(int argc, char **argv) 7: { 8: char msg[MSIZE]; 9: int pid, nprocs, i; 10:MPI_Status status; 11: 12:MPI_Init(&argc, &argv); 13: MPI_Comm_rank(MPI_COMM_WORLD, &pid); 14:MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 15: 16: if (pid == 0) { 17: for (i = 1; i < nprocs; i++) { 18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status); 19:fputs(msg, stdout); 20:} 21:} 22:else { 23: sprintf(msg, "Hello, world! (from process #%d)\n", pid); 24:MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD); 25: } 26: 27: MPI_Finalize(); 28: 29: return 0; 30: }

  13. Initialize and Terminate int MPI_Init( int *argc, /* pointer to argc */ char ***argv /* pointer to argv */ ); mpi_init(ierr) integer ierr! return code The attributes from command line must be passed directly to argc and argv. int MPI_Finalize(); mpi_finalize(ierr) integer ierr ! return code

  14. Commincator functions It returns the rank (process ID) in the communicator comm. int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ ); mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code It returns the total number of processes in the communicator comm. int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ ); mpi_comm_size(comm, size, ierr) integer comm, size integer ierr! return code • Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.

  15. MPI_Send It sends data to process “dest”. int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ ); mpi_send(buf, count, datatype, dest, tag, comm, ierr) <type> buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code • Tags are used for identification of message.

  16. MPI_Recv int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ ); mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) <type> buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code • The same tag as the sender’s one must be passed to MPI_Recv. • Set the pointers to a variable MPI_Status. It is a structure with three members: MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.

  17. datatype and count • The size of the message is identified with count and datatype. • MPI_CHAR char • MPI_INT int • MPI_FLOAT float • MPI_DOUBLE double … etc.

  18. Compile and Execution % icc –o hello hello.c -lmpi % mpirun –np 8 ./hello Hello, world! (from process #1) Hello, world! (from process #2) Hello, world! (from process #3) Hello, world! (from process #4) Hello, world! (from process #5) Hello, world! (from process #6) Hello, world! (from process #7)

  19. POSIX Thread • Standard API on Linux for controlling threads. • Portable Operating System Interface • Thread handling • pthread_create(); pthread_join(); pthread_exec(); • Synchronization • mutex • pthread_mutex_lock(); pthread_mutex_trylock(); pthread_mutex_unlock(); • Condition variable: Semaphore • pthread_cond_signal(); pthread_cond_wait(); etc.

  20. OpenMP #include <stdio.h> int main() { pragma omp parallel { int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes) } return 0; } • Multiple threads are generated by using pragma. • Variables declared globally can be shared.

  21. Convenient pragma for parallel execution #pragma omp parallel { #pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; } } • The assignment between i and thread is automatically adjusted in order that the load of each thread becomes even.

  22. CUDA/OpenCL • CUDA is developed for GPGPU programming. • SPMD(Single Program Multiple Data) • 3-D management of threads • 32 threads are managed with a Warp • SIMD programming • Architecture dependent memory model • OpenCL is standard language for heterogeneous accelerators.

  23. Heterogeneous Programming with CUDA Host Serial Code Device Parallel Kernel KernelA(args); … Host Serial Code Device … Parallel Kernel KernelB(args);

  24. Threads and thread blocks each thread executes the same code Kernel = grid of thread blocks Thread Block N-1 Thread Block 1 Thread Block 0 threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y=func(x); output[threadID]=y; … … float x = input[threadID]; float y=func(x); output[threadID]=y; … … float x = input[threadID]; float y=func(x); output[threadID]=y; … … Threads in the same block may synchronize with barriers. _syncthreads(); Thread blocks cannot synchronize -> Execution is depending on machines.

  25. Memory Hierarchy Thread Between host memory cudaMemcpy(); is used. Per-thread Local memory Block Per-block Shared Memory Kernel 0 Sequential Kernels … Per-device Global Memory Kernel 1 …

  26. CUDA extensions • Declaration specifiers : • _global_ void kernelFunc(…); // kernel function, runs on device • _device_ int GlobalVar; //variable in device memory • _shared_ int sharedVar; //variable in per-block shared memory • Extend function invocation syntax for paralell kernel launch • KernelFunc<<<dimGrid, dimBlock>>> // launch dimGrid blocks with dimBlock threads each • Special variables for thread identification in kernels • dim3 threadIDx; dim3 blockIdx; dim3 block Dim; dim3 gridDim; • Barrier Synchronization between threads • _syncthreads();

  27. CUDA runtime • Device menagement: • cudaGetDeviceCount(), cudaGetDeviceProperties(); • Device memory management: • cudaMalloc(), cudaFree(),cudaMemcpy() • Graphics interoperability: • cudaGLMapBufferObject(), cudaD3D9MapResources() • Texture management: • cudaBindTexture(), cudaBindTextureToArray()

  28. Example: Increment Array Elements _global_ void increment_gpu(float *a, float b, int N) { int idx=blockidx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b; } void main() { … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu<<dimGrid,dimBlock>>(a,b,N); } void increment_cpu(float *a, float b, int N) { for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b; } void main() { … increment_cpu(a,b,N); }

  29. Example: Increment Array Elements Let’s assume N=16, blockDim=4 blockIdx.x=3 blockDim.x=4 threadIdx.x=0,1,2,3 idx=12,13,14,15 blockIdx.x=2 blockDim.x=4 threadIdx.x=0,1,2,3 idx=8,9,10,11 blockIdx.x=1 blockDim.x=4 threadIdx.x=0,1,2,3 idx=4,5,6,7 blockIdx.x=0 blockDim.x=4 threadIdx.x=0,1,2,3 idx=0,1,2,3 int idx = blockDim.x * blockId.x + threadldx.x; will map from local index threadIdx to global index. blockDim should be >= 32 in real code! Using more number of blocks hides the memory latency in GPU.

  30. Host code // allocate host memory unsigned int numBytes = N*sizeof(float); float * h_A = (float *) malloc(numBytes); // allocate device memory float* d_A=0; cudaMalloc((void**)&d_a, numBytes); // copy data from host to device cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice); // execute the kernel increment_cpu<<<N/blockSize, blockSize>>>(d_A,b); // copy data from device back to host cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost); // free device memory cudaFree(d_A);

  31. GeForce GTX280 240 cores Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors … PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM Load/Store Global Memory

  32. Hardware Implementation: Execution Model Host Device Kernel 1 Grid1 Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Kernel 2 Grid2 Block (0,0) Block (1,0) Block (2,0) Block (1,1) Thread (0,0) Warp 0 … Thread (31,0) Thread (32,0) Warp 1 … Thread (63,0) Block (0,1) Block (1,1) Block (2,1) Thread (0,1) Warp 2 … Thread (31,1) Thread (32,1) Warp 3 … Thread (63,1) Thread (0,2) Warp 4 … Thread (31,2) Thread (32,2) … Warp 5 Thread (63,2) A multiprocessor executes the same instruction on a group of threads called a warp. Warp size= the number of threads in a warp

  33. Automatic parallelizing Compilers • Automatically translating a code for uniprocessors into multiprocessors. • Loop level parallelism is main target of parallelizing. • Fortran codes have been main targets • No pointers • The array structure is simple • Recently, restricted C becomes a target language • Oscar Compiler (Waseda Univ.), COINS

  34. Shared memory model vs.Message passing model • Benefits • Distributed OS is easy to implement. • Automatic parallelize compiler. • POSIX thread, OpenMP • Message passing • Formal verification is easy (Blocking) • No-side effect (Shared variable is side effect itself) • Small cost

  35. Parallel Programming Contest • In this lecture, a parallel programming contest will be held. • All students who want to get the credit must join it. • At least, the program must correctly run. • For students with good achievement, the credit will be given unconditionally.

More Related