Introduction to Hybrid Parallel Programming

Hybrid Parallel Programming Introduction ITCS4145/5145, Parallel Programming C. Ferner and B. Wilkinson March 13, 2014. hybrid.ppt

Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Interconnection Network Memory Memory Memory Memory Multi-core Computer Multi-core Computer Multi-core Computer Multi-core Computer

Hybrid (MPI-OpenMP) Parallel Computing • We can use MPI to run processes concurrently on each computer • We can use OpenMP to run threads concurrently on each core of a computer • Advantage: we can make use of shared-memory where communication is required • Why? – Because inter-computer communication is an order of magnitude slower than synchronization

Message-passing routines used to pass messages between computer systems and threads execute on each computer system using the multiple cores on the system

How to create a hybrid OpenMP- MPI program • Write source code with both MPI routines and OpenMP directives/routines • mpicc uses gcc linked with appropriate MPI libraries. gcc supports OpenMP with –fopenmpoption. So can use that: mpicc -fopenmp -o hybrid hybrid.c • Execute as an MPI program. For example on UNCC cluster cci-gridgw.uncc.edu mpiexec.hydra-f <machinesfile> -n <number of processes> ./hybrid (VERY IMPORTANT -- NOT FROM cci-grid05)

Example #include <stdio.h> #include <string.h> #include <stddef.h> #include <stdlib.h> #include "mpi.h" #define CHUNKSIZE 10 #define N 100 void openmp_code(){ … / next slide } main(intargc, char **argv ) { char message[20]; inti,rank, size, type=99; MPI_Statusstatus; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank == 0) { strcpy(message, "Hello, world"); for (i=1; i<size; i++) MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD); } else MPI_Recv(message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status); openmp_code(); //all MPI processes run OpenMP code, no message passing printf( "Message from process =%d : %.13s\n", rank,message); MPI_Finalize(); }

void openmp_code(){ intnthreads, tid, i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; // initialize arrays chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid= omp_get_thread_num(); if (tid == 0) { nthreads= omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for schedule(dynamic,chunk) for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } } /* end of parallel section */ }

Parallelizing a double for loop int main(int argc, char *argv[]) { int i, j, blksz, rank, P, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &P); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/P); #pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n",rank tid,i,j); } } } } Loop i parallelized across computers Loop j parallelized across threads Code and results from Dr. Ferner

Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3 rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

Hybrid (MPI-OpenMP) Parallel Computing Caution: Using the hybrid approach may not necessarily result in increased performance though – will strongly depend upon application.

Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix.

One way to parallelize Matrix multiplication using hybrid approach Parallelize i loop into partitioned among the computers with MPI for (i = 0; i < N; i++) for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } Parallelize j loop into partitioned among cores within each computer, using OpenMP

Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &P); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/P); MPI_Scatter (a,N*blksz,MPI_FLOAT, a,N*blksz,MPI_FLOAT,0,MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD); #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } Parallelize i loop into partitioned among the computers with MPI Parallelize j loop on each computer into partitioned using OpenMP Code and results from Dr. Ferner

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.659652 (seconds) $ diff out MMULT.o5357 1c1 < elapsed_time= 1.525183 (seconds) --- > elapsed_time= 0.626821 (seconds) $ Sequential Execution Time Hybrid Execution Time Sequential Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

Perhaps we could do better parallelizing the i loop both with MPI and OpenMP Parallelize i loop into partitioned among the computers/threads with MPI and OpenMP #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } But this loop is too complicated for OpenMP j loop not parallelized

#pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } } An if statement can simplify the loop

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better

Discussion Point • Why does the hybrid approach not outperform MPI-only for this problem? • For what kinds of problem might a hybrid approach do better?

Hybrid Parallel Programming with the Paraguin compiler • The Paraguin compiler can also create hybrid programs • This is because it uses mpicc, it will pass the OpenMP pragma through to the resulting source

Compiling • First we need to compile to source code • scc -DPARAGUIN -D__x86_64__ matrixmult.c -.out.c • Then we can compile with MPI and openmp • mpicc –fopenmp matrixmult.out.c –o matrixmult.out

Hybrid Matrix Multiplication using Paraguin The i loop will be partitioned among the computers #pragma paraguinbegin_parallel #pragma paraguin scatter a #pragma paraguinbcast b #pragma paraguinforall for (i = 0; i < N; i++) { #pragma omp parallel for private(tID, j,k) num_threads(4) for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } The j loop will be partitioned among the 4 cores within a computer

Debug Statements <pid 0, thread 1>: c[0][1] += a[0][0] * b[1][0] <pid 0, thread 1>: c[0][1] += a[0][1] * b[1][1] <pid 0, thread 1>: c[0][1] += a[0][2] * b[1][2] <pid 0, thread 2>: c[0][2] += a[0][0] * b[2][0] <pid 0, thread 2>: c[0][2] += a[0][1] * b[2][1] <pid 0, thread 2>: c[0][2] += a[0][2] * b[2][2] <pid 1, thread 1>: c[1][1] += a[1][0] * b[1][0] <pid 1, thread 1>: c[1][1] += a[1][1] * b[1][1] <pid 1, thread 1>: c[1][1] += a[1][2] * b[1][2] <pid 2, thread 1>: c[2][1] += a[2][0] * b[1][0] <pid 2, thread 1>: c[2][1] += a[2][1] * b[1][1] <pid 2, thread 1>: c[2][1] += a[2][2] * b[1][2] <pid 0, thread 0>: c[0][0] += a[0][0] * b[0][0] <pid 0, thread 0>: c[0][0] += a[0][1] * b[0][1] <pid 0, thread 0>: c[0][0] += a[0][2] * b[0][2] <pid 2, thread 0>: c[2][0] += a[2][0] * b[0][0] <pid 2, thread 0>: c[2][0] += a[2][1] * b[0][1] <pid 2, thread 0>: c[2][0] += a[2][2] * b[0][2] <pid 1, thread 0>: c[1][0] += a[1][0] * b[0][0] <pid 1, thread 0>: c[1][0] += a[1][1] * b[0][1] <pid 1, thread 0>: c[1][0] += a[1][2] * b[0][2] <pid 1, thread 2>: c[1][2] += a[1][0] * b[2][0] <pid 1, thread 2>: c[1][2] += a[1][1] * b[2][1] <pid 1, thread 2>: c[1][2] += a[1][2] * b[2][2] <pid 2, thread 2>: c[2][2] += a[2][0] * b[2][0] <pid 2, thread 2>: c[2][2] += a[2][1] * b[2][1] <pid 2, thread 2>: c[2][2] += a[2][2] * b[2][2]

What does not work with Paraguin • Consider: #pragmaomp parallel structured_block • Example: #pragmaomp parallel private(tID) num_threads(4) { tID= omp_get_thread_num(); printf("<pid %d>: tid = %d\n", __guin_rank, tID); } Very Important Opening brace must be on a new line

What does not work with Paraguin • The SUIF compiler removes the braces because they are not associated with a control structure • A #pragma is not a control structure, but rather a pre-processor directive. • After compiling with scc: #pragmaomp parallel private(tID) num_threads(4) tID = omp_get_thread_num(); printf("<pid %d>: tid = %d\n", __guin_rank, tID); Braces are removed

The Fix • The trick is to put in a control structure that basically does nothing: dummy = 0; #pragmaomp parallel private(tID) num_threads(4) if (dummy == 0) { tID = omp_get_thread_num(); printf ("<pid %d>: tid = %d\n", __guin_rank, tID); } • Note: “if (1)” does not work If statement will always be true. This code is basically left intact.

Questions

Introduction to Hybrid Parallel Programming