OpenMP EXERCISE part 1 – OpenMP v2.5

OpenMP EXERCISE part 1 – OpenMP v2.5 Ing. Andrea Marongiu a.marongiu@unibo.it

Download, compile and run • Download file OpenMP_Exercise.tgz from website • Extractit to local folder tar xvf OpenMP_Exercise.tgz • What’s on the package • Alltests are in file test.c • Compile and run with • makecleanallrun • Take a look attest.c. Differentexercises are #ifdef-ed • To compile and execute the desiredonemakecleanallrun –e MYOPTS="-DEX1 –DEX2 …"

EX 1 – Hello world! Parallelism creation #pragmaompparallelnum_threads (?) printf "Hello world, I’m thread ??" • Use parallel directive to create multiple threads • Each thread executes the code enclosed within the scope of the directive • Use runtime library functions to determine thread ID • All SPMD parallelization is based on this approach

EX 2 - Loop partitioning – Static scheduling T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (static) for (uint i=0; i<16; i++) { /* BALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 • Iterations are statically assigned to threads • 16 iter / 4 threads = 4 iter/thread • Small overhead: loop indexes are computed according to thread ID • Optimal scheduling if iworkload is balanced 2 6 10 14 3 7 11 15

EX 2 - Loop partitioning – Dynamic scheduling T0 T1 T2 T3 #pragmaompparallel for \ num_threads (4) schedule (dynamic, 4) for (uint i=0; i<16; i++) { /* BALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 • Iterations are dynamically assigned to threads • 16 iter, 4 by 4 • Same allocation of iteration as static (prev. slide) • Coarse granularity • Overhead only at beginning and end 2 6 10 14 3 7 11 15 CHUNK (size = 4 iter) OVERHEAD

EX 2 - Loop partitioning – Dynamic scheduling T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (dynamic, 1) for (uint i=0; i<16; i++) { /* BALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 • Iterations are dynamically assigned to threads • 16 iter, 1 by 1 • Finest granularity possible • Overhead at every iteration • Worst performance under balanced workloads 2 6 10 14 3 7 11 15 CHUNK (size = 1 iter) OVERHEAD

EX 3 – Unbalanced Loop partitioning T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (dynamic, 4) for (uint i=0; i<16; i++) { /* UNBALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 2 13 6 10 3 • Iterations are dynamically assigned to threads • 16 iter, 4 by 4 • Coarse granularity (same as static scheduling) • Due to barrier at the end of parreg, all threads have to wait for the slowest one 7 11 14 15 SYNCH POINT

EX 3 – Unbalanced Loop partitioning T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (dynamic, 1) for (uint i=0; i<16; i++) { /* UNBALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 2 6 10 3 • Iterations are dynamically assigned to threads • 16 iter, 1 by 1 • Finest granularity balances the workload among threads • In this case, it is worth to pay 14 7 11 15 SPEEDUP

EX 4 – Chunking overhead T0 T1 T2 T3 #pragmaompparallel for \ num_threads (4) schedule (dynamic, 1) for (uint i=0; i<16; i++) { /* SMALL LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 2 6 10 14 • Iterations are dynamically assigned to threads • 16 iter, 1 by 1 • Finest granularity possible • Overhead at every iteration • Serious performance loss for very small loop bodies 3 7 11 15 OVERHEAD

EX 5 – Task parallelism with sections void sections() { work(1000000); printf("%hu: Done with first elaboration!\n”, …); work(2000000); printf("%hu: Done with second elaboration!\n”, …); work(3000000); printf("%hu: Done with third elaboration!\n", …); work(4000000); printf("%hu: Done with fourth elaboration!\n", …); } • Distribute workload among 4 threads using SPMD parallelization • Get thread id • Use if/else or switch/case to differentiate workload • Implement the same workload partitioning with sections directive

EX 6 – Task parallelism with task void tasks() { unsigned int i; for(i=0; i<4; i++) { work((i+1)*1000000); printf("%hu: Done with elaboration\n", …); } } • Distribute workload among 4 threads using taskdirective • Same program as before • But we had to manually unroll the loop to use sections • Performance?

EX 6 – Task parallelism with task void tasks() { unsigned int i; for(i=0; i<1024; i++) { work((1000000); printf("%hu: Done with elaboration\n", …); } } Modify the EX6 exercise code as indicated on this slide • Parallelize the loop with task directive • Use single directive to force a single processor to create tasks • Parallelize the loop with single directive • Use nowait clause to allow for parallel execution • Performance?

OpenMP EXERCISE part 1 – OpenMP v2.5