Parallelizing C Programs Using Cilk

Parallelizing C Programs Using Cilk Mahdi Javadi

Cilk Language • Cilk is a language for multithreaded parallel programming based on C. • The programmer should not worry about scheduling the computation to run efficiently. • There are three additional keywords: cilk, spawn and sync.

Example: Fibonacci Int fib (int n) { int x, y; if (n<2) return n; x = fib (n-1); y = fib (n-2); return x+y; } cilk Int fib (int n) { int x, y; if (n<2) return n; x = spawn fib (n-1); y = spawn fib (n-2); sync; return x+y; }

Performance Measures • Tp= execution time on P processors. • T1 is called work. • T∞ is called span. • Obvious lower bounds: Tp ≥ T1/P Tp ≥ T∞ • p =T1/T∞is called parallelism.Using more than p processors makes little sense.

Cilk Compiler • The file extension should be “.cilk”. • Example: > cilkc -O3 fib.cilk -o fib • To find the 30th Fibonacci number using 4 CPUs: > fib --nproc 4 30 • To collect timings of each processor and compute the span (not efficient): > cilkc -cilk-profile -cilk-span -O3 fib.cilk -o fib

( ( ) ) ( ) ( ) C11 C12 C21 C22 C11 C12 C21 C22 A11 A12 A21 A22 B11 B12 B21 B22 . = = ( ) A11 B11+ A12 B21 A11 B12+ A12 B22 A21 B11+ A22 B21 A21 B12+ A22 B22 Example: Matrix Multiplication • Suppose we want to multiply two n by n matrices: • We can recursively formulate the problem: • i.e. one n by n matrix multiplication reduces to: 8 multiplications and for additions of (n/2) by (n/2) submatrices.

Multiplication Procedure Mult(C, A, B, n) if (n = 1) C[1,1] = A[1,1].B[1,1] else { spawn Mult(C11,A11,B11,n/2); … spawn Mult(C22,A21,B12,n/2); spawn Mult(T11,A12,B21,n/2); … spawn Mult(T22,A22,B22,n/2); sync; Add(C,T,n); }

Addition Procedure Add(C,T,n) if (n = 1) C[1,1] = C[1,1]+T[1,1]; else { spawn Add(C11,T11,n/2); … spawn Add(C22,T22,n/2); sync; } • T1 (work) for addition = O(n2). • T∞(span) for addition = O(log(n)).

Complexity of Multiplication • We know that matrix multiplication is O(n3) hence T1 (work) for multiplication = O(n3). • T∞: M∞(n)= M∞(n/2) + O(log(n)) = O(log2(n)). • p = T1/ T∞ = O(n3) / O(log2(n)). • To multiply 1000 by 1000: p = 107 ( a lot of CPUs !!!)

Discrete Fourier Transform DFT(n,w,p,…) ... t = w2 mod p DFT(n/2,t,p,…); DFT(n/2,t,p,…); … w1 = 1; for (i = 0; i < n/2; i++) { … a[i] = … w1 = w1.w mod p; } cilk DFT(n,w,p,…) ... t = w2 mod p spawn DFT(n/2,t,p,…); spawn DFT(n/2,t,p,…); sync; … spawn ParCom(n,a,p,1,…); cilk ParCom(n,a,p,m,…) if (n <= 512) … spawn ParCom(n/2,a,p,1,…); m’ = m . wn/2 mod p; spawn ParCom(n/2,a+n/2,p,m’,…); sync;

Complexity of ParCom • The sequential combining does n/2 multiplication. • T∞ (span) for ParCom: • T∞(n) = T∞(n/2) + O(log(n)) T∞(n) = O(log2(n)). • p = O(n/log2(n)). • We run FFT on “stan” which has 4 CPUs. • Thus p > 4 does not make sense, so we cut off the parallelism at some level of recursion to speed up the program.

Sequential FFT: 123789 (ms) Timings

Parallelizing C Programs Using Cilk

Parallelizing C Programs Using Cilk

Presentation Transcript

Predicate Abstraction of ANSI-C Programs using SAT

Simple C Programs

Parallelizing and Optimizing Programs for GPU Acceleration using CUDA

Parallelizing MiniSat

Cilk ++

Cilk Pousse

Cilk-5

Parallelizing Programs

Predicate Abstraction of ANSI-C Programs Using SAT

Compiling C Programs

Network programs in C/C++

Parallelizing HMM Decoding

Cilk

Analysis of Cilk

Cilk NOW

Parallelizing Computations

Parallelizing METIS

Parallelizing stencil computations

Using Ptolemy II to Re-architect Large C and C++ Programs

Simple C Programs

Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

Cilk NOW