470 likes | 608 Views
CSE-700 Parallel Programming Introduction. 박성우. POSTECH Sep 6, 2007. Common Features?. ... runs faster on. Multi-core CPUs. IBM Power4, dual-core, 2000 Intel reaches thermal wall , 2004 ) no more free lunch ! Intel Xeon, quad-core, 2006
E N D
CSE-700 Parallel ProgrammingIntroduction 박성우 POSTECH Sep 6, 2007
Multi-core CPUs • IBM Power4, dual-core, 2000 • Intel reaches thermal wall, 2004 ) no more free lunch! • Intel Xeon, quad-core, 2006 • Sony PlayStation 3 Cell, eight cores enabled, 2006 • Intel, 80-cores, 2011 (prototype finished) source: Herb Sutter - "Software and the concurrency revolution"
Parallel Programming Models • Posix threads (API) • OpenMP (API) • HPF (High Performance Fortran) • Cray's Chapel • Nesl • Sun's Fortress • IBM's X10 • ... • and a lot more.
Parallelism • Data parallelism • ability to apply a function in parallel to each element of a collection of data • Thread parallelism • ability to run multiple threads concurrently • Each thread uses its own local state. • Shared memory parallelism
Data Parallelism = Data Separation hardware thread #1 hardware thread #2 hardware thread #3 an a1 a2 ... an+m an+m+l an+1 an+2 ... an+m+1 ...
Data Parallelism in Hardware • GeForce 8800 • 128 stream processors @ 1.3Ghz, 500+GFlops
Data Parallelism in Programming Languages • Fortress • parallelism is the default. for i à 1:m, j à 1:n do // 1:n is a generator a[i, j] := b[i] c[j] end • Nesl (1990's) • supports nested data parallelism • the function being applied itself can be parallel. {sum(a) : a in [[2, 3], [8, 3, 9], [7]]};
Data Parallel Haskell (DAMP '07) • Haskell + nested data parallelism • flattening (vectorization) • transforms a nested parallel program such that it manipulates only flat arrays. • fusion • eliminate many intermediate arrays • Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements
Thread Parallelism synchronous communication hardware thread #1 hardware thread #2 message message local state local state
Pure Functional Threads • Purely functional threads can run concurrently. • Effect-free computations can be executed in parallel with any other effect-free computations. • Example: collision-detection A' B' A B
Manticore (DAMP '07) • Three layers • sequential base language • functional language drawn from SML • no mutable references and arrays! • data-parallel programming • Implicit: • the compiler and runtime system manage thread creation. • E.g.) parallel arrays of parallel arrays [: 2 * n | n in nums where n > 0 :] fun mapP f xs = [: f x | x in xs :] • concurrent programming
Concurrent Programming in Manticore (DAMP '07) • Based on Concurrent ML • threads and synchronous message passing • Threads do not share mutable states. • actually no mutable references and arrays • explicit: • The programmer manages thread creation.
Data ParallelismThread ParallelismShared Memory Parallelism(Shared State Concurrency)
Share Memory Parallelism hardware thread #1 hardware thread #2 hardware thread #3 shared memory
Company of Heroes • Interaction of a LOT of objects: • thousands of objects • Each object has its own mutable state. • Each object update affects several other objects. • All objects are updated 30+ times per second. • Problem: • How do we handle simultaneous updates to the same memory location?
Manual Lock-based Synchronization pthread_mutex_lock(mutex); mutate_variable(); pthread_mutex_unlock(mutex); • Locks and conditional variables ) fundamentally flawed!
Bank Accounts Beautiful Concurrency, Peyton Jones, 2007 • Invariant: atomicity • no thread observes a state in which the money has left one account, but has not arrived in the other. thread #1 thread #2 ... thread #n transfer request transfer request transfer request account A account B shared memory
Bank Accounts using Locks • In an object-oriented language: class Account { Int balance; synchronized void deposit (Int n) { balance = balance + n; }} • Code for transfer: void transfer (Account from, Account to, Int amount) { from.withdraw (amount); to.deposit (amount); } an intermediate state!
A Quick Fix: Explicit Locking void transfer (Account from, Account to, Int amount) { from.lock(); to.lock(); from.withdraw (amount); to.deposit (amount); from.unlock(); to.unlock(); } • Now, the program is prone to deadlock.
Locks are Bad • Taking two few locks ) simultaneous update • Taking too many locks ) no concurrency or deadlock • Taking the wrong locks ) error-prone programming • Taking locks in the wrong order ) error-prone programming • ... • Fundamental problem: no modular programming • Correct implementations of withdraw and deposit do not give a correct implementation of transfer.
Transactional Memory • An alternative to lock-based synchronization • eliminates many problems associated with lock-based synchronization • no deadlock • read sharing • safe modular programming • Hot research area • hardware transactional memory • software transactional memory • C, Java, functional languages, ...
Transactions in Haskell transfer :: Account -> Account -> Int -> IO () -- transfer 'amount' from account 'from' to account 'to' transfer from to amount = atomically (do { deposit to amount ; withdraw from amount }) • atomically act • atomicity: • the effects become visible to other threads all at once. • isolation: • the action act does not see any effects from other threads.
CSE-700 Parallel Programming Fall 2007
CSE-700 in a Nutshell • Scope • Parallel computing from the viewpoint of programmers and language designers • We will not talk about hardware for parallel computing • Audience • Anyone interested in learning parallel programming • Prerequisite • C programming • Desire to learn new programming languages
Material • Books • Introduction to Parallel Programming (2nd). Ananth Grama et al. • Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al. • Any textbook on MPI and OpenMP is fine. • Papers
Teaching Staff • Instructors • Gla • Myson • ... • and YOU! • We will lead this course TOGETHER.
Resources • Plquad • quad-core Linux • OpenMP and MPI already installed • Ask for an account if you need one.
Basic Plan - First Half • Goal • learn the basics of parallel programming through 5+ assignments on OpenMP and MPI • Each lecture consists of: • discussion on the previous assignment • Each of you is expected to give a presentation. • presentation on OpenMP and MPI by the instructors • discussion on the next assignment
Basic Plan - Second Half • Recent parallel languages • learn a recent parallel language • write a cool program in your parallel language • give a presentation on your experience • Topics in parallel language research • choose a topic • give a presentation on it
What Matters Most? • Spirit of adventure • Proactivity • Desire to provoke Happy Chaos • I want you to develop this course into a total, complete, yet happy chaos. • A truly inspirational course borders almost on chaos.
Impact of Memory Bandwidth [1] Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vectorcolumn_sum.
Impact of Memory Bandwidth [2] • The vector column_sum is small and easily fits into the cache • The matrix b is accessed in a column order. • The strided access results in very poor performance. Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.
Impact of Memory Bandwidth [3] We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.
Lesson • Memory layouts and organizing computation appropriately can make a significant impact on the spatial and temporal locality.
Typical Sequential Implementation • A : n x n • B : n x n • C = A * B : n x n for i = 1 to n for j = 1 to n C[i, j] = 0; for k = 1 to n C[i, j] += A[i, k] * B [k, j];
Using Submatrixes • Improves data locality significantly.
Assignment 1 • Machine • the older, the better. • Myson offers his ancient notebook for you. • Pentium II 600Mhz • no L1 cache • 64KB L2 cache • running Linux • Prepare a presentation on your experimental results.