CSE-700 Parallel Programming Introduction

CSE-700 Parallel ProgrammingIntroduction 박성우 POSTECH Sep 6, 2007

Common Features?

... runs faster on

Multi-core CPUs • IBM Power4, dual-core, 2000 • Intel reaches thermal wall, 2004 ) no more free lunch! • Intel Xeon, quad-core, 2006 • Sony PlayStation 3 Cell, eight cores enabled, 2006 • Intel, 80-cores, 2011 (prototype finished) source: Herb Sutter - "Software and the concurrency revolution"

Parallel Programming Models • Posix threads (API) • OpenMP (API) • HPF (High Performance Fortran) • Cray's Chapel • Nesl • Sun's Fortress • IBM's X10 • ... • and a lot more.

Parallelism • Data parallelism • ability to apply a function in parallel to each element of a collection of data • Thread parallelism • ability to run multiple threads concurrently • Each thread uses its own local state. • Shared memory parallelism

Data ParallelismThread ParallelismShared Memory Parallelism

Data Parallelism = Data Separation hardware thread #1 hardware thread #2 hardware thread #3 an a1 a2 ... an+m an+m+l an+1 an+2 ... an+m+1 ...

Data Parallelism in Hardware • GeForce 8800 • 128 stream processors @ 1.3Ghz, 500+GFlops

Data Parallelism in Programming Languages • Fortress • parallelism is the default. for i Ã 1:m, j Ã 1:n do // 1:n is a generator a[i, j] := b[i] c[j] end • Nesl (1990's) • supports nested data parallelism • the function being applied itself can be parallel. {sum(a) : a in [[2, 3], [8, 3, 9], [7]]};

Data Parallel Haskell (DAMP '07) • Haskell + nested data parallelism • flattening (vectorization) • transforms a nested parallel program such that it manipulates only flat arrays. • fusion • eliminate many intermediate arrays • Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements

Data ParallelismThread ParallelismShared Memory Parallelism

Thread Parallelism synchronous communication hardware thread #1 hardware thread #2 message message local state local state

Pure Functional Threads • Purely functional threads can run concurrently. • Effect-free computations can be executed in parallel with any other effect-free computations. • Example: collision-detection A' B' A B

Manticore (DAMP '07) • Three layers • sequential base language • functional language drawn from SML • no mutable references and arrays! • data-parallel programming • Implicit: • the compiler and runtime system manage thread creation. • E.g.) parallel arrays of parallel arrays [: 2 * n | n in nums where n > 0 :] fun mapP f xs = [: f x | x in xs :] • concurrent programming

Concurrent Programming in Manticore (DAMP '07) • Based on Concurrent ML • threads and synchronous message passing • Threads do not share mutable states. • actually no mutable references and arrays • explicit: • The programmer manages thread creation.

Data ParallelismThread ParallelismShared Memory Parallelism(Shared State Concurrency)

Share Memory Parallelism hardware thread #1 hardware thread #2 hardware thread #3 shared memory

World War II

Company of Heroes • Interaction of a LOT of objects: • thousands of objects • Each object has its own mutable state. • Each object update affects several other objects. • All objects are updated 30+ times per second. • Problem: • How do we handle simultaneous updates to the same memory location?

Manual Lock-based Synchronization pthread_mutex_lock(mutex); mutate_variable(); pthread_mutex_unlock(mutex); • Locks and conditional variables ) fundamentally flawed!

Bank Accounts Beautiful Concurrency, Peyton Jones, 2007 • Invariant: atomicity • no thread observes a state in which the money has left one account, but has not arrived in the other. thread #1 thread #2 ... thread #n transfer request transfer request transfer request account A account B shared memory

Bank Accounts using Locks • In an object-oriented language: class Account { Int balance; synchronized void deposit (Int n) { balance = balance + n; }} • Code for transfer: void transfer (Account from, Account to, Int amount) { from.withdraw (amount); to.deposit (amount); } an intermediate state!

A Quick Fix: Explicit Locking void transfer (Account from, Account to, Int amount) { from.lock(); to.lock(); from.withdraw (amount); to.deposit (amount); from.unlock(); to.unlock(); } • Now, the program is prone to deadlock.

Locks are Bad • Taking two few locks ) simultaneous update • Taking too many locks ) no concurrency or deadlock • Taking the wrong locks ) error-prone programming • Taking locks in the wrong order ) error-prone programming • ... • Fundamental problem: no modular programming • Correct implementations of withdraw and deposit do not give a correct implementation of transfer.

Transactional Memory • An alternative to lock-based synchronization • eliminates many problems associated with lock-based synchronization • no deadlock • read sharing • safe modular programming • Hot research area • hardware transactional memory • software transactional memory • C, Java, functional languages, ...

Transactions in Haskell transfer :: Account -> Account -> Int -> IO () -- transfer 'amount' from account 'from' to account 'to' transfer from to amount = atomically (do { deposit to amount ; withdraw from amount }) • atomically act • atomicity: • the effects become visible to other threads all at once. • isolation: • the action act does not see any effects from other threads.

Conclusion:We need parallelism!

Tim Sweeney's POPL '06 Invited Talk- Last Slide

CSE-700 Parallel Programming Fall 2007

CSE-700 in a Nutshell • Scope • Parallel computing from the viewpoint of programmers and language designers • We will not talk about hardware for parallel computing • Audience • Anyone interested in learning parallel programming • Prerequisite • C programming • Desire to learn new programming languages

Material • Books • Introduction to Parallel Programming (2nd). Ananth Grama et al. • Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al. • Any textbook on MPI and OpenMP is fine. • Papers

Teaching Staff • Instructors • Gla • Myson • ... • and YOU! • We will lead this course TOGETHER.

Resources • Plquad • quad-core Linux • OpenMP and MPI already installed • Ask for an account if you need one.

Basic Plan - First Half • Goal • learn the basics of parallel programming through 5+ assignments on OpenMP and MPI • Each lecture consists of: • discussion on the previous assignment • Each of you is expected to give a presentation. • presentation on OpenMP and MPI by the instructors • discussion on the next assignment

Basic Plan - Second Half • Recent parallel languages • learn a recent parallel language • write a cool program in your parallel language • give a presentation on your experience • Topics in parallel language research • choose a topic • give a presentation on it

What Matters Most? • Spirit of adventure • Proactivity • Desire to provoke Happy Chaos • I want you to develop this course into a total, complete, yet happy chaos. • A truly inspirational course borders almost on chaos.

Impact of Memory and Cache on Performance

Impact of Memory Bandwidth [1] Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vectorcolumn_sum.

Impact of Memory Bandwidth [2] • The vector column_sum is small and easily fits into the cache • The matrix b is accessed in a column order. • The strided access results in very poor performance. Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

Impact of Memory Bandwidth [3] We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.

Lesson • Memory layouts and organizing computation appropriately can make a significant impact on the spatial and temporal locality.

Assignment 1Cache & Matrix Multiplication

Typical Sequential Implementation • A : n x n • B : n x n • C = A * B : n x n for i = 1 to n for j = 1 to n C[i, j] = 0; for k = 1 to n C[i, j] += A[i, k] * B [k, j];

Using Submatrixes • Improves data locality significantly.

Experimental Results

Assignment 1 • Machine • the older, the better. • Myson offers his ancient notebook for you. • Pentium II 600Mhz • no L1 cache • 64KB L2 cache • running Linux • Prepare a presentation on your experimental results.

CSE-700 Parallel Programming Introduction

CSE-700 Parallel Programming Introduction

Presentation Transcript

CSE 380 – Computer Game Programming Introduction

CSE:141 Introduction to Programming

CSE 380 – Computer Game Programming Introduction

CSE 113 Introduction to Computer Programming

CSE 113 Introduction to Computer Programming

CSE 260 – Introduction to Parallel Computation

Introduction to Parallel Programming

CSE 260 – Introduction to Parallel Computation

CSE:141 Introduction to Programming

CSE 260 – Introduction to Parallel Computation

CSE 380 – Computer Game Programming Introduction

CSE-321 Programming Languages Introduction to Functional Programming

Introduction to Parallel Programming Concepts

CSE:141 Introduction to Programming

CSE 260 – Introduction to Parallel Computation

CSE:141 Introduction to Programming

CSE 260 – Introduction to Parallel Computation

CSE 260 – Introduction to Parallel Computation

CSE-321 Programming Languages Introduction to Functional Programming