1 / 62

ECE1747 Parallel Programming

Learn about parallel programming with shared memory using Pthreads, thread creation, synchronization, matrix multiplication examples, parallel SOR implementation, and thread management.

floridar
Download Presentation

ECE1747 Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE1747 Parallel Programming Shared Memory Multithreading Pthreads

  2. Shared Memory • All threads access the same shared memory data space. Shared Memory Address Space proc1 proc2 proc3 procN

  3. Shared Memory (continued) • Concretely, it means that a variable x, a pointer p, or an array a[] refer tothe same object, no matter what processor the reference originates from. • We have more or less implicitly assumed this to be the case in earlier examples.

  4. Shared Memory a proc1 proc2 proc3 procN

  5. Distributed Memory - Message Passing The alternative model to shared memory. mem1 mem2 mem3 memN a a a a proc1 proc2 proc3 procN network

  6. Shared Memory vs. Message Passing • Same terminology is used in distinguishing hardware. • For us: distinguish programming models, not hardware.

  7. Programming vs. Hardware • One can implement • a shared memory programming model • on shared or distributed memory hardware • (also in software or in hardware) • One can implement • a message passing programming model • on shared or distributed memory hardware

  8. Portability of programming models shared memory programming message passing programming shared memory machine distr. memory machine

  9. Shared Memory Programming: Important Point to Remember • No matter what the implementation, it conceptually looks like shared memory. • There may be some (important) performance differences.

  10. Multithreading • User has explicit control over thread. • Good: control can be used to performance benefit. • Bad: user has to deal with it.

  11. Pthreads • POSIX standard shared-memory multithreading interface. • Provides primitives for process management and synchronization.

  12. What does the user have to do? • Decide how to decompose the computation into parallel parts. • Create (and destroy) processes to support that decomposition. • Add synchronization to make sure dependences are covered.

  13. General Thread Structure • Typically, a thread is a concurrent execution of a function or a procedure. • So, your program needs to be restructured such that parallel parts form separate procedures or functions.

  14. Example of Thread Creation (contd.) main() pthread_ create(func) func()

  15. Thread Joining Example void *func(void *) { ….. } pthread_t id; int X; pthread_create(&id, NULL, func, &X); ….. pthread_join(id, NULL); …..

  16. Example of Thread Creation (contd.) main() pthread_ create(func) func() pthread_ join(id) pthread_ exit()

  17. Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

  18. Parallel Matrix Multiply • All i- or j-iterations can be run in parallel. • If we have p processors, n/p rows to each processor. • Corresponds to partitioning i-loop.

  19. Matrix Multiply: Parallel Part void mmult(void* s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for(i=from; i<to; i++) for(j=0; j<n; j++) { c[i][j] = 0.0; for(k=0; k<n; k++) c[i][j] += a[i][k]*b[k][j]; } }

  20. Matrix Multiply: Main int main() { pthread_t thrd[p]; for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, mmult,(void*) i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); }

  21. Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

  22. Parallel SOR • First (i,j) loop nest can be parallelized. • Second (i,j) loop nest can be parallelized. • Must wait to start second loop nest until all processors have finished first. • Must wait to start first loop nest of next iteration until all processors have second loop nest of previous iteration. • Give n/p rows to each processor.

  23. Pthreads SOR: Parallel parts (1) void* sor_1(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for( i=from; i<to; i++) for( j=0; j<n; j++ ) temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1][j] +grid[i][j-1] + grid[i][j+1]); }

  24. Pthreads SOR: Parallel parts (2) void* sor_2(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for( i=from; i<to; i++) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

  25. Pthreads SOR: main for some number of timesteps { for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_1, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_2, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); }

  26. Summary: Thread Management • pthread_create(): creates a parallel thread executing a given function (and arguments), returns thread identifier. • pthread_exit(): terminates thread. • pthread_join(): waits for thread with particular thread identifier to terminate.

  27. Summary: Program Structure • Encapsulate parallel parts in functions. • Use function arguments to parameterize what a particular thread does. • Call pthread_create() with the function and arguments, save thread identifier returned. • Call pthread_join() with that thread identifier.

  28. Pthreads Synchronization • Create/exit/join • provide some form of synchronization, • at a very coarse level, • requires thread creation/destruction. • Need for finer-grain synchronization • mutex locks, • condition variables.

  29. Use of Mutex Locks • To implement critical sections (as needed, e.g., in en_queue and de_queue in TSP). • Pthreads provides only exclusive locks. • Some other systems allow shared-read, exclusive-write locks.

  30. Condition variables (1 of 5) pthread_cond_init( pthread_cond_t *cond, pthread_cond_attr *attr) • Creates a new condition variable cond. • Attribute: ignore for now.

  31. Condition Variables (2 of 5) pthread_cond_destroy( pthread_cond_t *cond) • Destroys the condition variable cond.

  32. Condition Variables (3 of 5) pthread_cond_wait( pthread_cond_t *cond, pthread_mutex_t *mutex) • Blocks the calling thread, waiting on cond. • Unlocks the mutex.

  33. Condition Variables (4 of 5) pthread_cond_signal( pthread_cond_t *cond) • Unblocksone thread waiting on cond. • Which one is determined by scheduler. • If no thread waiting, then signal is a no-op.

  34. Condition Variables (5 of 5) pthread_cond_broadcast( pthread_cond_t *cond) • Unblocks all threads waiting on cond. • If no thread waiting, then broadcast is a no-op.

  35. Use of Condition Variables • To implement signal-wait synchronization discussed in earlier examples. • Important note: a signal is “forgotten” if there is no corresponding wait that has already happened.

  36. Use of Wait/Signal (Pipelining) • Sequential • Parallel (Pattern -- picture; horiz. line -- processor).

  37. PIPE P1:for( i=0; i<num_pics, read(in_pic); i++ ) { int_pic_1[i] = trans1( in_pic ); signal( event_1_2[i] ); } P2: for( i=0; i<num_pics; i++ ) { wait( event_1_2[i] ); int_pic_2[i] = trans2( int_pic_1[i] ); signal( event_2_3[i] ); }

  38. PIPE Using Pthreads • Replacing the original wait/signal by a Pthreads condition variable wait/signal will not work. • signals before a wait are forgotten. • we need to remember a signal.

  39. How to remember a signal (1 of 2) semaphore_signal(i) { pthread_mutex_lock(&mutex_rem[i]); arrived [i]= 1; pthread_cond_signal(&cond[i]); pthread_mutex_unlock(&mutex_rem[i]); }

  40. How to Remember a Signal (2 of 2) semaphore_wait(i) { pthreads_mutex_lock(&mutex_rem[i]); if( arrived[i] = 0 ) { pthreads_cond_wait(&cond[i], mutex_rem[i]); } arrived[i] = 0; pthreads_mutex_unlock(&mutex_rem[i]); }

  41. PIPE with Pthreads P1:for( i=0; i<num_pics, read(in_pic); i++ ) { int_pic_1[i] = trans1( in_pic ); semaphore_signal( event_1_2[i] ); } P2: for( i=0; i<num_pics; i++ ) { semaphore_wait( event_1_2[i] ); int_pic_2[i] = trans2( int_pic_1[i] ); semaphore_signal( event_2_3[i] ); }

  42. Note • Many shared memory programming systems (other than Pthreads) have semaphores as basic primitive. • If they do, you should use it, not construct it yourself. • Implementation may be more efficient than what you can do yourself.

  43. Parallel TSP process i: while( (p=de_queue()) != NULL ) { for each expansion by one city { q = add_city(p); if complete(q) { update_best(q) }; else en_queue(q); } }

  44. Parallel TSP • Need critical section • in update_best, • in en_queue/de_queue. • In de_queue • wait if q is empty, • terminate if all processes are waiting. • In en_queue: • signal q is no longer empty.

  45. Parallel TSP: Mutual Exclusion en_queue() / de_queue() { pthreads_mutex_lock(&queue); …; pthreads_mutex_unlock(&queue); } update_best() { pthreads_mutex_lock(&best); …; pthreads_mutex_unlock(&best); }

  46. Parallel TSP: Condition Synchronization de_queue() { while( (q is empty) and (not done) ) { waiting++; if( waiting == p ) { done = true; pthreads_cond_broadcast(&empty, &queue); } else { pthreads_cond_wait(&empty, &queue); waiting--; } } if( done ) return null; else remove and return head of the queue; }

  47. Pthreads SOR: main for some number of timesteps { for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_1, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_2, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); }

  48. Pthreads SOR: Parallel parts (1) void* sor_1(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for(i=from;i<to;i++) for( j=0; j<n; j++ ) temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1][j] +grid[i][j-1] + grid[i][j+1]); }

  49. Pthreads SOR: Parallel parts (2) void* sor_2(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for(i=from;i<to;i++) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

  50. Reality bites ... • Create/exit/join is not so cheap. • It would be more efficient if we could come up with a parallel program, in which • create/exit/join would happen rarely (once!), • cheaper synchronization were used. • We need something that makes all threads wait, until all have arrived -- a barrier.

More Related