Hybrid Parallel Programming Introduction in Hybrid Systems

Hybrid Parallel Programming Introduction

Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Interconnection Network Memory Memory Memory Memory Multi-core Computer Multi-core Computer Multi-core Computer Multi-core Computer

Hybrid Parallel Computing • We can use MPI to run processes concurrently on each computer • We can use OpenMP to run threads concurrently on each core of a computer • Advantage: we can make use of shared-memory where communication is required • Why? – Because inter-computer communication is an order of magnitude slower than synchronization

“More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] • MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]

Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] • Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]

Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix.

Matrix Multiplication Again • Each cell Ci,j can be computed independently of the other elements of the C matrix • If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication • But the computation of Ci,j is a dot-product of two arrays. • In other words, it is a reduction

Matrix Multiplication Again • If we have more than N2 processors, then we need to divide up the work of the reduction among processors • The reduction requires communication • Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP • However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!

How is this possible? • OpenMP is supported by icc and gcc compilers: • gcc –fopenmp <file.c> • icc –openmp <file.c> • MPI is a library that is linked with your C program: • mpicc <file.c> • mpicc uses gcc linked with the appropriate libraries

How is this possible? • So to use both MPI and OpenMP: • mpicc –fopenmp <file.c> • mpicc is simply a script

Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);

Simple Example Loop i parallelized across computers #pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } } } } Loop j parallelized across threads

Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3

Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

Back to Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } }

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.659652 (seconds) $ diff out MMULT.o5357 1c1 < elapsed_time= 1.525183 (seconds) --- > elapsed_time= 0.626821 (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } } An if statement can simplify the loop

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better

Discussion Point • Why does the hybrid approach not outperform MPI-only for this problem? • For what kinds of problem might a hybrid approach do better?

Questions

Hybrid Parallel Programming Introduction in Hybrid Systems

Hybrid Parallel Programming Introduction in Hybrid Systems

Presentation Transcript

Parallel Programming

Parallel Programming

PARALLEL programming

Hybrid Parallel Programming with MPI and Unified Parallel C

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming