1 / 22

Hybrid Parallel Programming

Hybrid Parallel Programming. Introduction. Hybrid Systems. Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Core. Interconnection Network.

corneliar
Download Presentation

Hybrid Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid Parallel Programming Introduction

  2. Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Interconnection Network Memory Memory Memory Memory Multi-core Computer Multi-core Computer Multi-core Computer Multi-core Computer

  3. Hybrid Parallel Computing • We can use MPI to run processes concurrently on each computer • We can use OpenMP to run threads concurrently on each core of a computer • Advantage: we can make use of shared-memory where communication is required • Why? – Because inter-computer communication is an order of magnitude slower than synchronization

  4. “More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] • MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]

  5. Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] • Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]

  6. Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix.

  7. Matrix Multiplication Again • Each cell Ci,j can be computed independently of the other elements of the C matrix • If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication • But the computation of Ci,j is a dot-product of two arrays. • In other words, it is a reduction

  8. Matrix Multiplication Again • If we have more than N2 processors, then we need to divide up the work of the reduction among processors • The reduction requires communication • Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP • However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!

  9. How is this possible? • OpenMP is supported by icc and gcc compilers: • gcc –fopenmp <file.c> • icc –openmp <file.c> • MPI is a library that is linked with your C program: • mpicc <file.c> • mpicc uses gcc linked with the appropriate libraries

  10. How is this possible? • So to use both MPI and OpenMP: • mpicc –fopenmp <file.c> • mpicc is simply a script

  11. Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);

  12. Simple Example Loop i parallelized across computers #pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } } } } Loop j parallelized across threads

  13. Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3

  14. Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

  15. Back to Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);

  16. Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } }

  17. Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.659652 (seconds) $ diff out MMULT.o5357 1c1 < elapsed_time= 1.525183 (seconds) --- > elapsed_time= 0.626821 (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

  18. Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP

  19. Back to Matrix Multiplication #pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } } An if statement can simplify the loop

  20. Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better

  21. Discussion Point • Why does the hybrid approach not outperform MPI-only for this problem? • For what kinds of problem might a hybrid approach do better?

  22. Questions

More Related