340 likes | 505 Views
SHAASTRA 2011 :: PARALLEL PROGRAMMING WORKSHOP. CILK++ WITH HANDS-ON SESSION. Cilk ++ Session Outline. Introduction to cilk Basic cilk programming constructs Constructing cilk programs for common Algorithms Cilk ++ Tools. Cilk ++ Session Outline. Introduction to cilk
E N D
SHAASTRA 2011 :: PARALLEL PROGRAMMING WORKSHOP CILK++ WITH HANDS-ON SESSION
Cilk++ Session Outline • Introduction to cilk • Basic cilk programming constructs • Constructing cilk programs for common Algorithms • Cilk++ Tools
Cilk++ Session Outline • Introduction to cilk • Basic cilk programming constructs • Constructing cilk programs for common Algorithms • Cilk++ Tools
Development history of cilk • Developed at MIT • Cilk-1 - work-stealing scheduling • Parallel Programming constructs in cilk-5 • Complexity theory for cilk parallel programs • Intel - optimized cilk++ framework
What is Cilk? A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. • Example applications:
… P P P $ $ $ Network Memory I/O Shared Memory Multiprocessor Systems • Abstraction form that cilk uses • One global unified shared memory • Homogenous Processors
Cilk++ Session Outline • Introduction to cilk • Basic cilk programming constructs • Constructing cilk programs for common Algorithms • Cilk++ Tools
int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); } } Cilk code intfib (int n) { if (n<2) return (n); else { intx,y; x = cilk_spawnfib(n-1); y = cilk_spawnfib(n-2); cilk_sync; return (x+y); } } C elision Fibonacci Number Computation Cilk++ is a faithful extension of C++. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types.
Basic Cilk Keywords intfib (int n) { if (n<2) return (n); else { intx,y; x = cilk_spawnfib(n-1); y = cilk_spawnfib(n-2); cilk_sync; return (x+y); } } The named childCilk procedure can execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.
Dynamic Multithreading intfib (int n) { if (n<2) return (n); else { intx,y; x = cilk_spawnfib(n-1); y = cilk_spawnfib(n-2); cilk_sync; return (x+y); } } Example:fib(4) 4 3 2 2 1 1 0 “Processor oblivious” Computation DAG 1 0
Parallelizing Vector Addition C++ void vadd (real *A, real *B, int n){ inti; for (i=0; i<n; i++) A[i]+=B[i]; } 1. Recursive Code void vadd (real *A, real *B, int n){ if (n<=BASE) { inti; for (i=0; i<n; i++) A[i]+=B[i]; } else { vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } }
Parallelizing Vector Addition C++ void vadd (real *A, real *B, int n){ inti; for (i=0; i<n; i++) A[i]+=B[i]; } 2. Insert Cilk keywords void vadd (real *A, real *B, int n){ if (n<=BASE) { inti; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawnvadd (A, B, n/2); cilk_spawnvadd (A+n/2, B+n/2, n-n/2); cilk_sync; } }
… P P P $ $ $ Network Memory I/O Dynamic Scheduling • Cilkallows the programmer to express potential parallelism in an application. • The Cilkscheduler maps Cilk threads onto processors dynamically at runtime. • Since on-line schedulers are complicated, we’ll illustrate the ideas with an off-line scheduler.
Cilk’s Work-Stealing Scheduler Each processor maintains a work DE-queueof ready threads, and it manipulates the bottom of the DE-queue like a stack. Spawn! P P P P
Cilk’s Work-Stealing Scheduler Spawn! Spawn! P P P P
Cilk’s Work-Stealing Scheduler Return! P P P P
Cilk’s Work-Stealing Scheduler Return! P P P P
Cilk’s Work-Stealing Scheduler Steal! P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s DE-queue.
Cilk’s Work-Stealing Scheduler Spawn! P P P P
Cilk++ Session Outline • Introduction to cilk • Basic cilk programming constructs • Constructing cilk programs for common Algorithms • Cilk++ Tools
Square-Matrix Multiplication bn1 b21 b11 an1 a21 a11 c11 c21 cn1 an2 a22 a12 b22 bn2 b12 cn2 c12 c22 ann a2n b1n b2n bnn a1n cnn c2n c1n L L L L L L L L L M M M M M M M M M O O O n cij = aikbkj k= 1 = X C A B Assume for simplicity that n = 2k.
Recursive Matrix Multiplication C11 C12 A11 A12 B11 B12 C21 C22 A21 A22 B21 B22 A11B11 A11B12 A12B21 A12B22 = + A21B11 A21B12 A22B21 A22B22 Divide and conquer — = x 8 multiplications of (n/2) x(n/2) matrices. 1 addition of n xn matrices.
Matrix Multiply in Pseudo-Cilk void Mult(*C, *A, *B, n) { float *T = malloc(n*n*sizeof(float)); cilk_spawnMult(C11,A11,B11,n/2); cilk_spawnMult(C12,A11,B12,n/2); cilk_spawnMult(C22,A21,B12,n/2); cilk_spawnMult(C21,A21,B11,n/2); cilk_spawnMult(T11,A12,B21,n/2); cilk_spawnMult(T12,A12,B22,n/2); cilk_spawnMult(T22,A22,B22,n/2); cilk_spawnMult(T21,A22,B21,n/2); cilk_sync; cilk_spawnAdd(C,T,n); cilk_sync; return; }
Matrix Multiply in Pseudo-Cilk void Mult(*C, *A, *B, n) { float *T = malloc(n*n*sizeof(float)); cilk_spawnMult(C11,A11,B11,n/2); cilk_spawnMult(C12,A11,B12,n/2); cilk_spawnMult(C22,A21,B12,n/2); cilk_spawnMult(C21,A21,B11,n/2); cilk_spawnMult(T11,A12,B21,n/2); cilk_spawnMult(T12,A12,B22,n/2); cilk_spawnMult(T22,A22,B22,n/2); cilk_spawnMult(T21,A22,B21,n/2); cilk_sync; cilk_spawnAdd(C,T,n); cilk_sync; return; } void Add(*C, *T, n) { h base case & partition matrices i cilk_spawnAdd(C11,T11,n/2); cilk_spawnAdd(C12,T12,n/2); cilk_spawnAdd(C21,T21,n/2); cilk_spawnAdd(C22,T22,n/2); cilk_sync; return; }
Merge Sort 3 4 12 14 19 21 33 46 3 12 19 46 4 14 21 33 3 19 12 46 4 33 14 21 void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) malloc(n*sizeof(int)); cilk_spawnMergeSort(C, A, n/2); cilk_spawnMergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); } } merge merge merge 19 3 12 46 33 4 21 14
Parallel Merge void P_Merge(int *C, int *A, int *B, intna, intnb) { if (na < nb) { cilk_spawnP_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; intmb = BinarySearch(A[ma], B, nb); cilk_spawnP_Merge(C, A, B, ma, mb); cilk_spawnP_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); cilk_sync; } }
Cilk-for Loop • Loop Level Parallelism • Loop granularity specification void vadd(int*C, int *A, int *B, int n) { cilk_for(inti=0;i<n;i++) { c[i] = a[i] + b[i]; } }
Truly Hands-on Session • Parallel Quick-Sort : +/- Mutex • Parallel Breadth First Search • Parallel Linear Search • Parallel partial sums
Cilk++ Session Outline • Introduction to cilk • Basic cilk programming constructs • Constructing cilk programs for common Algorithms • Cilk++ Tools
Cilkview • Performance Analysis of Parallel Programs • Estimated Speedup • Burdened Parallelism • Granularity Level of cilk_for • Efficiency of Reducer
Cilkscreen • Data Races detection • Output: • #Races to a statement/Variable
Key Ideas • Cilk is simple: cilk, spawn, sync • Recursion, recursion, recursion, … • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span • Work & span
tHank you Next up : OpenMP, MPI