1 / 61

High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *

High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *. Jeremy R. Johnson *Some of this lecture was derived from Pthreads Programming by Nichols, Buttlar, and Farrell and POSIX Threads Programming Tutorial (computing.llnl.gov/tutorials/pthreads) by Blaise Barney.

marcel
Download Presentation

High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Computing(CS 540)Shared Memory Programming with OpenMP and Pthreads* Jeremy R. Johnson *Some of this lecture was derived from Pthreads Programming by Nichols, Buttlar, and Farrell and POSIX Threads Programming Tutorial (computing.llnl.gov/tutorials/pthreads) by Blaise Barney Parallel Processing

  2. Introduction • Objective: To further study the shared memory model of parallel programming. Introduction to the OpenMP and Pthreads for shared memory parallel programming • Topics • Concurrent programming with UNIX Processes • Introduction to shared memory parallel programming with Pthreads • Threads • fork/join • race conditions • Synchronization • performance issues - synchronization overhead, contention and granularity, load balance, cache coherency and false sharing. • Introduction parallel program design paradigms • Data parallelism (static scheduling) • Task parallelism with workers • Divide and conquer parallelism (fork/join) Parallel Processing

  3. Introduction • Topics • OpenMP vs. Pthreads • hello_pthreadsc • hello_openmp.c • Parallel Regions and execution model • Data parallelism with loops • Shared vs. private variables • Scheduling and chunk size • Synchronization and reduction variables • Functional parallelism with parallel sections • Case Studies Parallel Processing

  4. Processes • Processes contain information about program resources and program execution state • Process ID, process group ID, user ID, and group ID • Environment • Working directory • Program instructions • Registers • Stack • Heap • File descriptors • Signal actions • Shared libraries • Inter-process communication tools (such as message queues, pipes, semaphores, or shared memory). Parallel Processing

  5. UNIX Process Parallel Processing

  6. Threads • An independent stream of instructions that can be scheduled to run • Stack pointer • Registers (program counter) • Scheduling properties (such as policy or priority) • Set of pending and blocked signals • Thread specific data • “lightweight process” • Cost of creating and managing threads much less than processes • Threads live within a process and share process resources such as address space • Pthreads – standard thread API (IEEE Std 1003.1) Parallel Processing

  7. Threads within a UNIX Process Parallel Processing

  8. Shared Memory Model • All threads have access to the same global, shared memory • All threads within a process share the same address space • Threads also have their own private data • Programmers are responsible for synchronizing access (protecting) globally shared data. Parallel Processing

  9. Simple Example void do_one_thing(int *); void do_another_thing(int *); void do_wrap_up(int, int); int r1 = 0, r2 = 0; extern int main(void) { do_one_thing(&r1); do_another_thing(&r2); do_wrap_up(r1, r2); return 0; } Parallel Processing

  10. Virtual Address Space SP PC GP0 GP1 … Registers do_another_thing() i j k -------------------------------------- main() Stack main() -------- do_one_thing() -------- do_another_thing() --------- Text PID UID GID Identity r1 r2 Open Files Locks Sockets … Data Resources Heap Parallel Processing

  11. Simple Example (Processes) int shared_mem_id, *shared_mem_ptr; int *r1p, *r2p; extern int main(void) { pid_t child1_pid, child2_pid; int status; /* initialize shared memory segment */ if ((shared_mem_id = shmget(IPC_PRIVATE, 2*sizeof(int), 0660)) == -1) perror("shmget"), exit(1); if ((shared_mem_ptr = (int *)shmat(shared_mem_id, (void *)0, 0)) == (void *)-1 ) perror("shmat failed"), exit(1); r1p = shared_mem_ptr; r2p = (shared_mem_ptr + 1); *r1p = 0; *r2p = 0; Parallel Processing

  12. Simple Example (Processes) /* parent */ if ((waitpid(child1_pid, &status, 0) == -1)) perror("waitpid"), exit(1); if ((waitpid(child2_pid, &status, 0) == -1)) perror("waitpid"), exit(1); do_wrap_up(*r1p, *r2p); return 0; } if ((child1_pid = fork()) == 0) { /* first child */ do_one_thing(r1p); return 0; } else if (child1_pid == -1) { perror("fork"), exit(1); } /* parent */ if ((child2_pid = fork()) == 0) { /* second child */ do_another_thing(r2p); return 0; } else if (child2_pid == -1) { perror("fork"), exit(1); } Parallel Processing

  13. Virtual Address Space SP PC GP0 GP1 … Registers Virtual Address Space do_one_thing() i j k --------------------------- main() Stack SP PC GP0 GP1 … Registers do_another_thing() i j k --------------------------- main() Stack main() -------- do_one_thing() -------- do_another_thing() --------- Text PID UID GID Identity main() -------- do_one_thing() -------- do_another_thing() --------- Text PID UID GID Identity Open Files Locks Sockets … Data Resources Heap Open Files Locks Sockets … Data Resources Heap Shared Memory Parallel Processing

  14. Simple Example (PThreads) if (pthread_join(thread1, NULL) != 0) perror("pthread_join"),exit(1); if (pthread_join(thread2, NULL) != 0) perror("pthread_join"),exit(1); do_wrap_up(r1, r2); return 0; } int r1 = 0, r2 = 0; extern int main(void) { pthread_t thread1, thread2; if (pthread_create(&thread1, NULL, do_one_thing, (void *) &r1) != 0) perror("pthread_create"), exit(1); if (pthread_create(&thread2, NULL, do_another_thing, (void *) &r2) != 0) perror("pthread_create"), exit(1); Parallel Processing

  15. Virtual Address Space SP PC GP0 GP1 … Registers do_another_thing() i j k -------------------------------------- main() Stack Thread 1 SP PC GP0 GP1 … Registers do_another_thing() i j k -------------------------------------- main() Stack Thread 2 main() -------- do_one_thing() -------- do_another_thing() -------- --------- PID UID GID Text Identity Open Files Locks Sockets … Resources r1 r2 Data Heap Parallel Processing

  16. Concurrency and Parallelism do_another_thing() do_one_thing() do_wrap_up() do_one_thing() do_another_thing() do_wrap_up() do_wrap_up() do_one_thing() do_another_thing() Time Parallel Processing

  17. Unix Fork • The fork() call • Creates a child process that is identical to the parent process • The child has its own PID • The fork() call provides different return values to the parent [child’s PID] and the child [0] Parallel Processing

  18. Parent fork -------- fork() -------- --------- -------- fork() -------- --------- -------- fork() -------- --------- PID = 7274 PID = 7274 PID = 7275 Child Parallel Processing

  19. Thread Creation • pthread_create creates a new thread and makes it executable • pthread_create (thread,attr,start_routine,arg) • thread - unique identifier • attr – attribute • Start_routine – the routine the newly created thread will execute • arg – a single argument passed to start_routine Parallel Processing

  20. Thread Creation • Once created, threads are peers, and may create other threads Parallel Processing

  21. Thread Join • "Joining" is one way to accomplish synchronization between threads. • The pthread_join() subroutine blocks the calling thread until the specified threadid thread terminates. Parallel Processing

  22. Fork/Join Overhead • Compare the overhead of procedure call, process fork/join, thread create/join • Procedure call (no args) • 1.2  10-8 sec (.12 ns) • Process • 0.0012 sec (1.2 ms) • Thread • 0.000042 sec (42 s) Parallel Processing

  23. Race Conditions • When two or more threads access the same resource at the same time Thread 1 Thread 2 Balance Withdraw $50 Withdraw $50 Read Balance $125 Time Read Balance $125 Set Balance $75 Set Balance $75 Parallel Processing

  24. Bad Count pthread_setconcurrency(numcounters); for (i=0;i<numcounters;i++) { error = pthread_create(&tid[i],NULL,(void *(*)(void *))count,&limit); } for (i=0;i<numcounters;i++) { error = pthread_join(tid[i],NULL); } printf("Counters finished with count = %d\n",sum); printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit); return 0; } int sum= 0; void count(int *arg) { int i; for (i=0;i<*arg;i++) { sum++; } } int main(int argc, char **argv) { int error,i; int numcounters = NUMCOUNTERS; int limit = LIMIT; pthread_t tid[NUMCOUNTERS]; Parallel Processing

  25. Mutex • Mutex variables are for protecting shared data when multiple writes occur. • A mutex variable acts like a "lock" protecting access to a shared data resource. Only one thread can own (lock) a mutex at any given time Parallel Processing

  26. Mutex Operations • pthread_mutex_lock (mutex) • The pthread_mutex_lock() routine is used by a thread to acquire a lock on the specified mutex variable. If the mutex is already locked by another thread, this call will block the calling thread until the mutex is unlocked. • Pthread_mutex_unlock (mutex) • will unlock a mutex if called by the owning thread. Calling this routine is required after a thread has completed its use of protected data if other threads are to acquire the mutex for their work with the protected data. Parallel Processing

  27. Good Count pthread_setconcurrency(numcounters); pthread_mutex_init(&lock,NULL); for (i=1;i<=numcounters;i++) { error = pthread_create(&tid[i],NULL,(void *(*)(void *))count, &limit); } for (i=1;i<=numcounters;i++) { error = pthread_join(tid[i],NULL); } printf("Counters finished with count = %d\n",sum); printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit); return 0; } int sum= 0; pthread_mutex_t lock; void count(int *arg) { int i; for (i=0;i<*arg;i++) { pthread_mutex_lock(&lock); sum++; pthread_mutex_unlock(&lock); } } int main(int argc, char **argv) { int error,i; int numcounters = NUMCOUNTERS; int limit = LIMIT; pthread_t mytid, tid[MAXCOUNTERS]; Parallel Processing

  28. Better Count int sum= 0; pthread_mutex_t lock; void count(int *arg) { int i; int localsum = 0; for (i=0;i<*arg;i++) { localsum++; } pthread_mutex_lock(&lock); sum = sum + localsum; pthread_mutex_unlock(&lock); } Parallel Processing

  29. Threadsafe Code • Refers to an application's ability to execute multiple threads simultaneously without "clobbering" shared data or creating "race" conditions. Parallel Processing

  30. Condition Variables • While mutexes implement synchronization by controlling thread access to data, condition variables allow threads to synchronize based upon the actual value of data. • Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met. • A condition variable is a way to achieve the same goal without polling • Always used with a mutex Parallel Processing

  31. Using Condition variables Thread A • Do work up to the point where a certain condition must occur (such as "count" must reach a specified value) • Lock associated mutex and check value of a global variable • Call pthread_cond_wait() to perform a blocking wait for signal from Thread-B. Note that a call to pthread_cond_wait() automatically and atomically unlocks the associated mutex variable so that it can be used by Thread-B. • When signalled, wake up. Mutex is automatically and atomically locked. • Explicitly unlock mutex • Continue Thread B • Do work • Lock associated mutex • Change the value of the global variable that Thread-A is waiting upon. • Check value of the global Thread-A wait variable. If it fulfills the desired condition, signal Thread-A. • Unlock mutex. • Continue Parallel Processing

  32. Condition Variable Example void *inc_count(void *idp) { inti=0, save_state, save_type; int *my_id = idp; for (i=0; i<TCOUNT; i++) { pthread_mutex_lock(&count_lock); count++; if (count == COUNT_THRES) { pthread_cond_signal(&count_hit_threshold); } pthread_mutex_unlock(&count_lock); } return(NULL); } void *watch_count(void *idp) { int i=0, save_state, save_type; int *my_id = idp; pthread_mutex_lock(&count_lock); while (count < COUNT_THRES) { pthread_cond_wait(&count_hit_threshold, &count_lock); } pthread_mutex_unlock(&count_lock); return(NULL); } Parallel Processing

  33. OpenMP • Extension to FORTRAN, C/C++ • Uses directives (comments in FORTRAN, pragma in C/C++) • ignored without compiler support • Some library support required • Shared memory model • parallel regions • loop level parallelism • implicit thread model • communication via shared address space • private vs. shared variables (declaration) • explicit synchronization via directives (e.g. critical) • library routines for returning thread information (e.g. omp_get_num_threads(), omp_get_thread_num() ) • Environment variables used to provide system info (e.g. OMP_NUM_THREADS) Parallel Processing

  34. Benefits • Provides incremental parallelism • Small increase in code size • Simpler model than message passing • Easier to use than thread library • With hardware and compiler support smaller granularity than message passing. Parallel Processing

  35. Further Information • Adopted as a standard in 1997 • Initiated by SGI • www.openmp.org • computing.llnl.gov/tutorials/openMP • Chandra, Dagum, Kohr, Maydan, McDonald, Menon, “Parallel Programming in OpenMP”, Morgan Kaufman Publishers, 2001. • Chapman, Jost, and Van der Pas, “Using OpenMP: Portable Shared Memory Parallel Programming,” The MIT Press, 2008. Parallel Processing

  36. ... P0 P1 Pn Memory Shared vs. Distributed Memory P0 P1 Pn ... M0 M1 Mn Interconnection Network Shared memory Distributed memory Parallel Processing

  37. Shared Memory Programming Model • Shared memory programming does not require physically shared memory so long as there is support for logically shared memory (in either hardware or software) • If logical shared memory then there may be different costs for accessing memory depending on the physical location. • UMA - uniform memory access • SMP - symmetric multi-processor • typically memory connected to processors via a bus • NUMA - non-uniform memory access • typically physically distributed memory connected via an interconnection network Parallel Processing

  38. Hello_openmp.c #include <stdio.h> #include <stdlib.h> #include <omp.h> int main(int argc, char **argv) { int n; if (argc > 1) { n = atoi(argv[1]); omp_set_num_threads(n); } printf("Number of threads = %d\n",omp_get_num_threads()); #pragma omp parallel { int id = omp_get_thread_num(); printf("Hello World from %d\n",id); if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads()); } exit(0); } Parallel Processing

  39. Compiling & Running Hello_openmp % gcc –fopenmp hello_openmp.c –o hello % ./hello 4 Number of threads = 1 Hello World from 1 Hello World from 0 Hello World from 3 Number of threads = 4 Hello World from 2 The order of the print statements is nondeterministic Parallel Processing

  40. Execution Model Master thread Implicit thread creation (fork) Parallel Region Master and slave threads Implicit barrier synchronization (join) Master thread Parallel Processing

  41. Explicit Barrier #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { int n; if (argc > 1) { n = atoi(argv[1]); omp_set_num_threads(n); } printf("Number of threads = %d\n",omp_get_num_threads()); #pragma omp parallel { int id = omp_get_thread_num(); printf("Hello World from %d\n",id); #pragma omp barrier if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads()); } exit(0); } Parallel Processing

  42. Output with Barrier %./hellob 4 Number of threads = 1 Hello World from 1 Hello World from 0 Hello World from 2 Hello World from 3 Number of threads = 4 The order of the “Hello World” print statements are nondeterministic; however, the Number of threads print statement always comes at the end Parallel Processing

  43. Hello_pthreads.c printf("Number of threads = %d\n",pthread_getconcurrency()); for (i=0;i<n;i++) { pid[i]=i; error = pthread_create(&tid[i], NULL,(void *(*)(void *))hello, &pid[i]); } for (i=0;i<n;i++) { error = pthread_join(tid[i],NULL); } exit(0); } #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <errno.h> #define MAXTHREADS 32 int main(int argc, char **argv) { int error,i,n; void hello(int *pid); pthread_t tid[MAXTHREADS],mytid; int pid[MAXTHREADS]; if (argc > 1) { n = atoi(argv[1]); if (n > MAXTHREADS) { printf("Too many threads\n"); exit(1); } pthread_setconcurrency(n); } Parallel Processing

  44. Hello_pthreads.c % gcc -pthreadhello.c -o hello % ./hello 4 Number of threads = 4 Hello World from 0 (tid = 1832728912) Hello World from 1 (tid = 1824336208) Number of threads = 4 Hello World from 3 (tid = 1807550800) Hello World from 2 (tid = 1815943504) The order of the print statements is nondeterministic void hello(int *pid) { pthread_t tid; tid = pthread_self(); printf("Hello World from %d (tid = %u)\n",*pid,(unsigned int) tid); if (*pid == 0) printf("Number of threads = %d\n",pthread_getconcurrency()); } Parallel Processing

  45. Types of Parallelism Data Parallelism Functional Parallelism FORK JOIN FORK JOIN LOOP F1 F2 F3 F4 Threads execute same instructions Threads execute different instructions … but on different data … and can read same data but should write different data

  46. Parallel Loop int a[1000], b[1000]; int main() { int i; int N = 1000; // Serial Initialization for (i=0; i<N; i++) a[i] = i; b[i] = N-i; #pragma omp for shared(a,b), private(i), schedule(static) for (i=0;i<N;i++) { a[i] = a[i] + b[i]; } int a[1000], b[1000]; int main() { int i; int N = 1000; for (i=0; i<N; i++) a[i] = i; b[i] = N-i; for (i=0;i<N;i++) { a[i] = a[i] + b[i]; } Parallel Processing

  47. Scheduling of Parallel Loop     a +     b 0 1 2 Nthreads-1  tid Stripmining Parallel Processing

  48. Implementation of Parallel Loop void vadd(int *id) { int i; for (i=*id;i<N;i+=numthreads) { a[i] = a[i] + b[i]; } } for (i=0;i<numthreads;i++) { id[i] = i; error = pthread_create(&tid[i],NULL,(void *(*)(void *))vadd, &id[i]); } for (i=0;i<numthreads;i++) { error = pthread_join(tid[i],NULL); } Parallel Processing

  49. Scheduling Chunks of Parallel Loop     a chunk0 Chunk 1 Chunk 2     b chunk0 Chunk Nthreads-1 0 1 2  tid Parallel Processing

  50. Implementation of Chunking #pragma omp for shared(a,b), private(i), schedule(static,CHUNK) for (i=0;i<N;i++) { a[i] = a[i] + b[i]; } void vadd(int *id) { int i,j; for (i=*id*CHUNK;i<N;i+=numthreads*CHUNK) { for (j=0;j<CHUNK;j++) a[i+j] = a[i+j] + b[i+j]; } } Parallel Processing

More Related