Lecture 5. OpenMP Intro.

COM503 Parallel Computer Architecture & Programming Lecture 5. OpenMPIntro. Prof. Taeweon Suh Computer Science Education Korea University

References • Lots of the lecture slides are based on the following web materials with some modifications • https://computing.llnl.gov/tutorials/openMP/ • Official OpenMP page • http://openmp.org/ • It contains up-to-date information on OpenMP • It also includes tutorials, book example codes etc.

OpenMP • What does OpenMP stand for? • Short version: Open Multi-Processing • Long version: Open specifications for Multi-Processing via collaborative work between interested parties from the hardware and software industry, government and academia • Standardized: • Jointly defined and endorsed by a group of major computer hardware and software vendors • Comprised of three primary API components: • Compiler Directives (#pragma) • Runtime Library Routines • Environment Variables • Portable: • The API is specified for C/C++ and Fortran • Most major platforms have been implemented including Linux and Windows platforms

History • In the early 90's, shared-memory machine vendors supplied similar, directive-based, Fortran programming extensions • The user would augment a serial Fortran program with directives specifying which loops were to be parallelized • The compiler would be responsible for automatically parallelizing such loops across the SMP processors • The OpenMP standard specification started in 1997 • Led by the OpenMP Architecture Review Board (ARB) • The ARB members included Compaq / Digital, HP, Intel, IBM, Kuck & Associates, Inc. (KAI), Silicon Graphics, Sun Microsystems , and U.S. Department of Energy

Release History • Our textbook covers OpenMP 2.5 specification OpenMP 4.0 OpenMP Spec. released together for C and Fortran from OpenMP 2.5 July 2013

Shared Memory Machines • OpenMP is designed for shared memory machines (UMA or NUMA)

Fork-Join Model • Begin as a single process (master thread). • The master thread executes sequentially until the first parallel region construct is encountered. • FORK: the master thread creates a team of parallel threads • The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads • JOIN: When the team threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread Source: https://computing.llnl.gov/tutorials/openMP/#Introduction

Other Parallel Programming Models? • MPI: Message Passing Interface • Developed for distributed-memory architectures, where multiple processes execute independently and communicate data • Most widely used in the high-end technical computing community, where clusters are common • Most vendors of shared memory systems also provide MPI implementations • Most MPI implementations consist of a specific set of APIs callable from C, C++ ,Fortran or Java • MPI implementations • MPICH • Freely available, portable implementation of MPI • Free Softwareand is available for most flavors of Unix and Windows • OpenMPI

Other Parallel Programming Models? • Pthreads: POSIX (Portable Operating System Interface) Threads • Shared-memory programming model • Defined as a set of C and C++ programming types and procedure calls • A collection of routines for creating, managing, and coordinating a collection of threads • Programming with Pthreads is much more complex than with OpenMP

Serial Code Example • Serial version of dot product program • The dot product is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors) and returns a single number obtained by multiplying corresponding entries and adding up those products #include <stdio.h> int main(argc, argv) intargc; char * argv[]; { double sum; double a[256], b[256]; inti, n; n = 256; for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } sum = 0.0; for (i=0 ; i<n; i++) { sum = sum + a[i]*b[i]; } printf("sum = %9.2lf\n", sum); }

MPI Example • MPI • To compile, ‘mpiccdot_product_mpi.c –o dot_product_mpi’ • To run, ‘mpirun–np 4 machine_filedot_product_mpi’ #include <stdio.h> #include <mpi.h> int main(argc, argv) intargc; char * argv[]; { double sum, sum_local; double a[256], b[256]; inti, n; intnumprocs, myid, my_first, my_last; n = 256; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); printf("#procs = %d\n", numprocs); my_first = myid * n/numprocs; my_last = (myid + 1) * n/numprocs; for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } sum_local = 0.0; for (i=my_first ; i < my_last; i++) { sum_local = sum_local + a[i]*b[i]; } MPI_Allreduce(&sum_local, &sum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); if (myid==0) printf("sum = %9.2lf\n", sum); MPI_Finalize(); return 0; }

Pthreads Example • Pthreads • To compile, ‘gccdot_product_pthread.c –o dot_product_pthread-pthread’ for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } pthread_mutex_init(&mutexsum, NULL); pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); for (i=0; i<NUMTHRDS; i++) { pthread_create(&thds[i], &attr, dotprod, (void *)i); } pthread_attr_destroy(&attr); for (i=0; i<NUMTHRDS; i++) { pthread_join(thds[i], (void **) &status); } printf("sum = %9.2lf\n", sum); pthread_mutex_destroy(&mutexsum); pthread_exit(NULL); } #include <stdio.h> #include <pthread.h> #define NUMTHRDS 4 double sum = 0 ; double a[256], b[256]; int status; int n = 256; pthread_tthds[NUMTHRDS]; pthread_mutex_tmutexsum; void *dotprod(void *arg); int main(argc, argv) intargc; char * argv[]; { pthread_attr_tattr; inti;

Pthreads Example • Pthreads • To compile, ‘gccdot_product_pthread.c –o dot_product_pthread-pthread’ void *dotprod(void *arg) { intmyid, i, my_first, my_last; double sum_local; myid = (int) arg; my_first = myid * n / NUMTHRDS; my_last = (myid+1) * n / NUMTHRDS; sum_local = 0.0; for (i=my_first; i< my_last; i++) { sum_local = sum_local + a[i]*b[i]; } pthread_mutex_lock(&mutexsum); sum = sum + sum_local; pthread_mutex_unlock(&mutexsum); pthread_exit((void *) 0); }

Another Pthread Example • Pthreads • To compile, ‘gccpthread_creation.c –o pthread_creation-pthread’ #include <stdio.h> #include <stdlib.h> #include <pthread.h> void *print_message_function( void *ptr ); main() { pthread_t thread1, thread2; char *message1 = "Thread 1"; char *message2 = "Thread 2"; int iret1, iret2; /* Create independent threads each of which will execute function */ iret1 = pthread_create( &thread1, NULL, print_message_function, (void*) message1); iret2 = pthread_create( &thread2, NULL, print_message_function, (void*) message2); /* Wait till threads are complete before main continues. Unless we */ /* wait we run the risk of executing an exit which will terminate */ /* the process and all threads before the threads have completed. */ pthread_join( thread1, NULL); pthread_join( thread2, NULL); printf("Thread 1 returns: %d\n",iret1); printf("Thread 2 returns: %d\n",iret2); exit(0); } void *print_message_function( void *ptr ) { char *message; message = (char *) ptr; printf("%s \n", message); }

OpenMP Example • OpenMP • To compile, ‘gccdot_product_omp.c –o dot_product_omp-fopenmp’ #include <stdio.h> #include <omp.h> int main(argc, argv) intargc; char * argv[]; { double sum; double a[256], b[256]; inti, n; n = 256; for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } sum = 0.0; #pragmaomp parallel for reduction(+:sum) for (i=0 ; i<n; i++) { sum = sum + a[i]*b[i]; } printf("sum = %9.2lf\n", sum); }

OpenMP • Compiler Directive Based • Parallelism is specified through the use of compiler directives in source code • Nested Parallelism Support • Parallel regions inside of other parallel regions. • Dynamic Threads • Dynamically alter the number of threads to execute parallel regions. • Memory Model • Provide a "relaxed-consistency”. In other words, threads can cache their data and are not required to maintain exact consistency with real memory all the time. • When it is critical that all threads view a shared variable identically, the programmer is responsible for ensuring that the variable is flushed by all threads as needed.

OpenMP Parallel Construct • A parallel region is a block of code executed by multiple threads. This is the fundamental OpenMP parallel construct. #pragmaomp parallel [clause[[,] clause]…] structured block • This construct is used to specify the block that should be executed in parallel • A team of threads is created to execute the associated parallel region • Each thread in the team is assigned a unique thread number (0 to #thread-1) • The master is a member of that team and has thread number 0 • Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. • It does not distribute the work of the region among the threads in a team if the programmer does not use the appropriate syntax to specify this action • There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. • The code not enclosed by a parallel construct will be executed serially

Example #include <stdio.h> #include <stdlib.h> #include <omp.h> int main(argc, argv) intargc; char * argv[]; { #pragmaomp parallel { printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); if (omp_get_thread_num() == 2) { printf(" Thread %d does things differently\n", omp_get_thread_num()); } } }

How Many Threads? • The number of threads in a parallel region is determined by the following factors, in order of precedence • Evaluation of the IF clause • The IF clause is supported on the parallel construct only • #pragmaomp parallel if (n > 5) • num_threads clause with a parallel construct • The num_threads clause is supported on the parallel construct only • #pragmaomp parallel num_threads(8) • omp_set_num_threads() library function • omp_set_num_threads(8) • OMP_NUM_THREADS environment variable • In bash, use ‘export OMP_NUM_THREADS=4’ • Implementation default - usually the number of CPUs on a node, even though it could be dynamic

Example #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 8 int main(argc, argv) intargc; char * argv[]; { int n = 6; omp_set_num_threads(NUM_THREADS); //#pragmaomp parallel #pragmaomp parallel if (n > 5) num_threads(n) { printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); if (omp_get_thread_num() == 2) { printf(" Thread %d does things differently\n", omp_get_thread_num()); } } }

Work-Sharing Constructs • Work-sharing constructs are used to distribute computation among the threads in a team • #pragmaomp for • #pragmaomp sections • #pragmaomp single • By default, threads wait at a barrier at the end of a work-sharing region until the last thread has completed its share of the work • However, the programmer can suppress this by using the nowait clause

Loop Construct • The loop construct causes the immediately following loop iterations to be executed in parallel #pragmaomp for [clause[[,] clause]…] for loop #include <stdio.h> #include <omp.h> #define NUM_THREADS 4 int main(argc, argv) intargc; char * argv[]; { int n = 8; inti; omp_set_num_threads(NUM_THREADS); #pragmaomp parallel shared(n) private(i) { #pragmaomp for for (i=0 ; i<n; i++) printf(" Thread %d executes loop iteration %d\n", omp_get_thread_num(), i); } }

Section Construct • The section construct is the easiest way to have different threads execute different kinds of work #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 4 void funcA(); void funcB(); int main(argc, argv) intargc; char * argv[]; { int n = 8; inti; omp_set_num_threads(NUM_THREADS); #pragmaomp parallel sections { #pragmaomp section funcA(); #pragmaomp section funcB(); } } #pragmaomp sections[clause[[,] clause]…] { [#pragmaomp section] structured block [#pragmaomp section] structured block } void funcA() { printf("In funcA: this section is executed by thread %d\n", omp_get_thread_num()); } void funcB() { printf("In funcB: this section is executed by thread %d\n", omp_get_thread_num()); }

Section Construct • At run time, the specified code blocks are executed by the threads in the team • Each thread executes one code block at a time • Each code block will be executed exactly once • If there are fewer threads than code blocks, some threads execute multiple code blocks • If there are fewer code blocks than threads, the remaining threads will be idle • Assignment of code blocks to threads is implementation-dependent • Depending on the type of work performed in the various code blocks and the number of threads used, this construct might lead to a load-balancing problem

Single Construct • The single construct specifies that the block should be executed by one thread only #include <stdio.h> #include <omp.h> #define NUM_THREADS 4 #define N 8 int main(argc, argv) intargc; char * argv[]; { inti, a, b[N]; omp_set_num_threads(NUM_THREADS); #pragmaomp parallel shared(a, b) private(i) { #pragmaomp single { a = 10; printf("Single construct executed by thread %d\n", omp_get_thread_num()); } #pragmaomp for for (i=0; i<N; i++) b[i] = a; } printf("After the parallel region\n"); for (i=0; i<N; i++) printf("b[%d] = %d\n", i, b[i]); } #pragmaomp single [clause[[,] clause]…] structured block

Single Construct • Only one thread executes the block with the single construct • The other threads wait at a implicit barrier until the thread executing the single code block has completed • What if the single construct is omitted in the previous example? • Memory consistency issue? • Performance issue? • A barrier is required then before the #pragmaompfor?

Misc • A useful Linux command • top (Display Linux tasks) provides a dynamic real-time view of a running system • Try 1, z, H after running the top command

Misc • Useful Linux commands • ps –eLf • Display thread IDs for OpenMP and Pthreads • top • Display process IDs, which can be used to monitor the processes created by MPI

Misc • top does not show threads • ps –eLf • Display thread IDs for OpenMP and Pthreads

Backup Slides

Goal of OpenMP • Standardization: • Provide a standard among a variety of shared memory architectures/platforms • Lean and Mean: • Establish a simple and limited set of directives for programming shared memory machines. • Significant parallelism can be implemented by using just 3 or 4 directives. • Ease of Use: • Provide capability to incrementally parallelize a serial program, unlike message-passing libraries which typically require an all or nothing approach • Provide the capability to implement both coarse-grain and fine-grain parallelism • Portability: • Supports Fortran (77, 90, and 95), C, and C++ • Public forum for API and membership

Lecture 5. OpenMP Intro.

Lecture 5. OpenMP Intro.

Presentation Transcript

Intro to Phylogenetic Trees Lecture 5

Lecture #8 Intro.

OpenMP

OpenMP

OpenMP

Lecture 1 - Intro

Lecture 1 (Intro)

Lecture 5: Intro to Entropy

Lecture 1 (Intro)

Lecture 6. OpenMP

OpenMP Lecture #4 Performance Tuning

Lecture 5: OpenMP and Performance Evaluation

OpenMP

OpenMP

OpenMP

Intro. To Env. Eng. Lecture 5

Lecture Intro B

OpenMP

OpenMP intro and Using Loop Scheduling in OpenMP

Lecture #8 Intro.

OpenMP