(How) Can Programmers Conquer the Multicore Menace?

(How) Can Programmers Conquer the Multicore Menace? Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace

Today: The Happily ObliviousAverage Joe Programmer • Joe is oblivious about the processor • Moore’s law bring Joe performance • Sufficient for Joe’s requirements • Joe has built a solid boundary between Hardware and Software • High level languages abstract away the processors • Ex: Java bytecode is machine independent • This abstraction has provided a lot of freedom for Joe • Parallel Programming is only practiced by a few experts

Moore’s Law 1,000,000,000 Itanium 2 Itanium 100,000,000 P4 P3 10,000,000 P2 Pentium 486 1,000,000 386 286 100,000 8086 10,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors From David Patterson

Uniprocessor Performance (SPECint) 1,000,000,000 Itanium 2 Itanium 100,000,000 P4 P3 10,000,000 P2 Pentium 486 1,000,000 386 286 100,000 8086 10,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors From David Patterson

Squandering of the Moore’s Dividend • 10,000x performance gain in 30 years! (~46% per year) • Where did this performance go? • Last decade we concentrated on correctness and programmer productivity • Little to no emphasis on performance • This is reflected in: • Languages • Tools • Research • Education • Software Engineering: Only engineering discipline where performance or efficiency is not a central theme

Matrix Multiply An Example of Unchecked Excesses • Abstraction and Software Engineering • Immutable Types • Dynamic Dispatch • Object Oriented • High Level Languages • Memory Management • Transpose for unit stride • Tile for cache locality • Vectorization • Prefetching • Parallelization 296,260x 12,316x 33,453x 87,042x 2,271x 1,117x 7,514x 522x 220x

Matrix Multiply An Example of Unchecked Excesses • Typical Software Engineering Approach • In Java • Object oriented • Immutable • Abstract types • No memory optimizations • No parallelization • Good Performance Engineering Approach • In C/Assembly • Memory optimized (blocked) • BLAS libraries • Parallelized (to 4 cores) 296,260x • In Comparison: Lowest to Highest MPG in transportation 14,700x 294,000x

Joe the Parallel Programmer • Moore’s law is not bringing anymore performance gains • If Joe needs performance he has to deal with multicores • Joe has to deal with performance • Joe has to deal with parallelism Joe

Why Parallelism is Hard • A huge increase in complexity and work for the programmer • Programmer has to think about performance! • Parallelism has to be designed in at every level • Programmers are trained to think sequentially • Deconstructing problems into parallel tasks is hard for many of us • Parallelism is not easy to implement • Parallelism cannot be abstracted or layered away • Code and data has to be restructured in very different (non-intuitive) ways • Parallel programs are very hard to debug • Combinatorial explosion of possible execution orderings • Race condition and deadlock bugs are non-deterministic and illusive • Non-deterministic bugs go away in lab environment and with instrumentation

Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Joint work with Marek Olszewski and Jason Ansel • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace

Racing for Lock Acquisition • Two threads • Start at the same time • 1st thread: 1000 instructions to the lock acquisition • 2nd thread: 1100 instructions to the lock acquisition Time Instruction #

Non-Determinism • Inherent in parallel applications • Accesses to shared data can experience many possible interleavings • New! Was not the case for sequential applications! • Almost never part of program specifications • Simplest parallel programs, i.e. a work queue, is non deterministic • Non-determinism is undesirable • Hard to create programs with repeatable results • Difficult to perform cyclic debugging • Testing offers weaker guarantees 14

Deterministic Multithreading • Observation: • Non-determinism need not be a required property of threads • We can interleave thread communication in a deterministic manner • Call this Deterministic Multithreading • Deterministic multithreading: • Makes debugging easier • Tests offer guarantees again • Supports existing programming models/languages • Allows programmers to “determinize” computations that have previously been difficult to do so using today’s programming idioms • e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and Delaunay Triangulation (Kulkarni et al. 2008)

Deterministic Multithreading • Strong Determinism • Deterministic interleaving for all accesses to shared data for a given input • Attractive, but difficult to achieve efficiently without hardware support • Weak Determinism • Deterministic interleaving of all lock acquisitions for a given input • Cheaper to enforce • Offers same guarantees as strong determinism for data-race-free program executions • Can be checked with a dynamic race detector!

Kendo • A Prototype Deterministic Locking Framework • Provides Weak Determinism for C and C++ code • Runs on commodity hardware today! • Implements a subset of the pthreads API • Enforces determinism without sacrificing load balance • Tracks progress of threads to dynamically construct the deterministic interleaving: • Deterministic Logical Time • Incurs low performance overhead (16% geomean on Splash2)

Deterministic Logical Time • Abstract counterpart to physical time • Used to deterministically order events on an SMP machine • Necessary to construct the deterministic interleaving • Represented as P independently updated deterministic logical clocks • Not updated based on the progress of other threads (unlike Lamport clocks) • Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic Logical Time if: • Thread 1 has lower deterministic logical clock than Thread 2 at time of events

Deterministic Logical Clocks • Requirements • Must be based on events that are deterministically reproducible from run to run • Track progress of threads in physical time as closely as possible (for better load balancing of the deterministic interleaving) • Must be cheap to compute • Must be portable over micro-architecture • Must be stored in memory for other threads to observe

Deterministic Logical Clocks • Some x86 performance counter events satisfy many of these requirements • Chose the “Retired Store Instructions” event • Required changes to Linux Kernel • Performance counters are kernel level accessible only • Added an interrupt service routine • Increments each thread’s deterministic logical clock (in memory) on every performance counter overflow • Frequency of overflows can be controlled

Locking Algorithm • Construct a deterministic interleaving of lock acquires from deterministic logical clocks • Simulate the interleaving that would occur if running in deterministic logical time • Uses concept of a turn • It’s a thread’s turn when: • All thread’s with smaller ID have greater deterministic logical clocks • All thread’s with larger ID have greater or equal deterministic logical clocks

Locking Algorithm function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); enable_logical_clock(); } function det_mutex_unlock(l) { unlock(l); }

Example Thread 1 Thread 2 t=3 t=5 Physical Time Deterministic Logical Time

Example Thread 1 Thread 2 t=6 Physical Time t=11 Deterministic Logical Time It’s a race!

Example Thread 1 Thread 2 Physical Time t=11 t=20 Deterministic Logical Time It’s a race!

Example Thread 1 Thread 2 Physical Time t=18 Deterministic Logical Time t=25 det_lock(a)

Example Thread 1 Thread 2 Physical Time t=18 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

Example Thread 1 Thread 2 Physical Time det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

Example Thread 1 Thread 2 Physical Time wait_for_turn() det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

Example Thread 1 Thread 2 Physical Time lock() det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

Example Thread 1 Thread 2 Physical Time det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn() Thread 2 will always acquire the lock first!

Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time t=25 det_lock(a) wait_for_turn() t=26

Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time t=25 det_lock(a) lock(a) t=26

Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time t=25 det_lock(a) lock(a) det_unlock(a) t=32

Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time det_lock(a) t=28 det_unlock(a) t=32

Locking Algorithm Improvements • Eliminate deadlocks in nested locks • Make thread increment its deterministic logical clock while it spins on the lock • Must do so deterministically • Queuing for fairness • Lock priority boosting • See ASPLOS09 Paper on Kendo for details

Evaluation • Methodology • Converted Splash2 benchmark suite to run use the Kendo framework • Eliminated data-races • Checked determinism by examining output and the final deterministic logical clocks of each thread • Experimental Framework • Processor: Intel Core2 Quad-core running at 2.66GHz • OS: Linux 2.6.23 (modified for performance counter support)

Results

Effect of interrupt frequency

Related Work • DMP – Deterministic Multiprocessing • Hardware design that provides Strong Determinism • StreamIt Language • Streaming programming model only allows one interleaving of inter-thread communication • Cilk Language • Fork/join programming model that can produce programs with semantics that always match a deterministic “serialization” of the code • Cannot be used with locks • Must be data-race free (can be checked with a Cilk race detector)

Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks • Joint work with Jason Ansel, Cy Chan, Yee Lok Wong, Qin Zhao, and Alan Edelman • Conquering the Multicore Menace

Observation 1: Algorithmic Choice • For many problems there are multiple algorithms • Most cases there is no single winner • An algorithm will be the best performing for a given: • Input size • Amount of parallelism • Communication bandwidth / synchronization cost • Data layout • Data itself (sparse data, convergence criteria etc.) • Multicores exposes many of these to the programmer • Exponential growth of cores (impact of Moore’s law) • Wide variation of memory systems, type of cores etc. • No single algorithm can be the best for all the cases

Observation 2: Natural Parallelism • World is a parallel place • It is natural to many, e.g. mathematicians • ∑, sets, simultaneous equations, etc. • It seems that computer scientists have a hard time thinking in parallel • We have unnecessarily imposed sequential ordering on the world • Statements executed in sequence • for i= 1 to n • Recursive decomposition (given f(n) find f(n+1)) • This was useful at one time to limit the complexity…. But a big problem in the era of multicores

Observation 3: Autotuning • Good old days  model based optimization • Now • Machines are too complex to accurately model • Compiler passes have many subtle interactions • Thousands of knobs and billions of choices • But… • Computers are cheap • We can do end-to-end execution of multiple runs • Then use machine learning to find the best choice

PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } } • Implicitly parallel description A y y AB AB h h c w x B c x w

PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } • Implicitly parallel description • Algorithmic choice A b1 AB AB a1 a2 B b2

PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } // Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); } a AB AB ab1 ab2 B b1 b2

PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } // Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); } // Recursively decompose in h to(AB.region(0, 0, w, h/2) ab1, AB.region(0, h/2, w, h ) ab2) from(A.region(0, 0, c, h/2) a1, A.region(0, h/2, c, h ) a2, B b) { ab1=MatrixMultiply(a1, b); ab2=MatrixMultiply(a2, b); } }

PetaBricksCompiler Internals Compiler Passes Choice Dependency Graph PetaBricks Source Code Compiler Passes ChoiceGrid Rule/Transform Headers ChoiceGrid ChoiceGrid Rule Bodies Rule Body IR ChoiceGrid Compiler Passes ChoiceGrid Sequential Leaf Code ParallelDynamically Scheduled Runtime Code Generation C++

Choice Grids transformRollingSumfrom A[n] to B[n] { Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … } Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … } } Input A: n 0 Rule2 Rule1 or Rule2 B: n 0 1

(How) Can Programmers Conquer the Multicore Menace?

(How) Can Programmers Conquer the Multicore Menace?

Presentation Transcript

How

HOW ?

HOW

How

How

How?

how

how

HOW

How much…? How many…?

How

How

How

How

How?

How

how

??? HOW ???

HOW???

how??

How