140 likes | 336 Views
CS 341 Programming Language Design and Implementation. Administrative: Final project part 1, topic selection : due Mon, 4/21 @ 11am Today: parallel programming for performance…. Async vs. Parallel Programming…. Async programming: Better responsiveness… Long-running operations
E N D
CS 341 Programming Language Design and Implementation • Administrative: • Final project part 1, topic selection: due Mon, 4/21 @ 11am • Today: • parallel programming for performance… CS 341 -- 18 Apr 2014
Async vs. Parallel Programming… • Asyncprogramming: • Better responsiveness… • Long-running operations • I/O operations • OS calls • Parallel programming: • Better performance… • Numerical processing • Data analysis • Big data Lecture 30 CS 341 -- 18 Apr 2014 2
HW solution for better performance? • Multiple cores… Lecture 30 CS 341 -- 18 Apr 2014 3
SWsolution for better performance? • Threading across different cores… C C Main <<start Work1>> <<start Work2>> My work… Work2 Stmt4; Stmt5; Stmt6; Work1 Stmt1; Stmt2; Stmt3; C C C C C C Workerthread Workerthread Mainthread Lecture 30 CS 341 -- 18 Apr 2014 4
First things to consider… • how to divide up the data? • rows? columns? blocks? nodes? sub-trees? • is workload evenly-distributed, or unpredictable? • if workload even, we can decide ahead of time how to divide up ("static") • if workload is unpredictable, need adaptive approach ("dynamic") • how to map threads onto the data? • which threads touch which data?
Demo: • The importance of data layout & locality… • Matrix multiplication Lecture 30 CS 341 -- 18 Apr 2014
Sequential Going parallel // // Naïve, triply-nested sequential solution: // Parallel.For(0, N, (i) => for (inti = 0; i < N; i++) { for(int j= 0; j< N; j++) { for(int k= 0; k< N; k++) C[i][j] += (A[i][k] * B[k][j]); } } ); Parallel fork Sequential Works great! 2x faster on 2 cores, 4x faster on 4 cores, … join
But wait… • What’s the other half of the chip? • Are we using it effectively? Memory cache… Lecture 30 CS 341 -- 18 Apr 2014
Memory architecture C C • Key features: • Registers • L1 cache • L2 cache • L3 cache • RAM C C L3 Cache Each level of cache is 10x slower to access: L1: 1 cycle L2: 10 cycles L3: 100 cycles RAM: 1000 cycles
A better matrix multiply… X Lecture 30 CS 341 -- 18 Apr 2014
Cache-friendly matrix multiplication — Step 1 • Loop interchange so inner-most loop goes along row… Parallel.For(0, N, (i) => { for (intk= 0; k < N; k++) for (intj= 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); }); Factor of 2-10x improvement!
Cache-friendly matrix multiplication — Step 2 • work in blocks so they fit in L1 cache… for (intjj=0; jj<N; jj+=BS) { intjjEND= Min(jj+BS, N); // initialize: for (inti=0; i<N; i++) for (int j=jj; j < jjEND; j++) C[i][j] = 0.0; // block multiply: for (intkk=0; kk<N; kk+=BS) { intkkEND= Min(kk+BS, N); for (inti=0; i<N; i++) for (int k=kk; k < kkEND; k++) for (int j=jj; j < jjEND; j++) C[i][j] += (A[i][k] * B[k][j]); } } Another factor of 2-4x…
Current state of parallel programming • in mainstream languages… CS 341 -- 18 Apr 2014
Parallel execution model in C# Parallel.For( ... ); task task task task Windows Process (.NET) App Domain App Domain App Domain C C C C C C C C Task Parallel Library .NET Thread Pool Task Scheduler worker thread worker thread worker thread worker thread Resource Manager Windows