1 / 70

(How) Can Programmers Conquer the Multicore Menace?

(How) Can Programmers Conquer the Multicore Menace? . Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Outline. The Multicore Menace Deterministic Multithreading via Kendo Algorithmic Choices via PetaBricks

chaney
Download Presentation

(How) Can Programmers Conquer the Multicore Menace?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (How) Can Programmers Conquer the Multicore Menace? Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

  2. Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace

  3. Today: The Happily ObliviousAverage Joe Programmer • Joe is oblivious about the processor • Moore’s law bring Joe performance • Sufficient for Joe’s requirements • Joe has built a solid boundary between Hardware and Software • High level languages abstract away the processors • Ex: Java bytecode is machine independent • This abstraction has provided a lot of freedom for Joe • Parallel Programming is only practiced by a few experts

  4. Moore’s Law 1,000,000,000 Itanium 2 Itanium 100,000,000 P4 P3 10,000,000 P2 Pentium 486 1,000,000 386 286 100,000 8086 10,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors From David Patterson

  5. Uniprocessor Performance (SPECint) 1,000,000,000 Itanium 2 Itanium 100,000,000 P4 P3 10,000,000 P2 Pentium 486 1,000,000 386 286 100,000 8086 10,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors From David Patterson

  6. Uniprocessor Performance (SPECint) 1,000,000,000 Itanium 2 Itanium 100,000,000 P4 P3 10,000,000 P2 Pentium 486 1,000,000 386 286 100,000 8086 10,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Number of Transistors From David Patterson

  7. Squandering of the Moore’s Dividend • 10,000x performance gain in 30 years! (~46% per year) • Where did this performance go? • Last decade we concentrated on correctness and programmer productivity • Little to no emphasis on performance • This is reflected in: • Languages • Tools • Research • Education • Software Engineering: Only engineering discipline where performance or efficiency is not a central theme

  8. Matrix Multiply An Example of Unchecked Excesses • Abstraction and Software Engineering • Immutable Types • Dynamic Dispatch • Object Oriented • High Level Languages • Memory Management • Transpose for unit stride • Tile for cache locality • Vectorization • Prefetching • Parallelization 296,260x 12,316x 33,453x 87,042x 2,271x 1,117x 7,514x 522x 220x

  9. Matrix Multiply An Example of Unchecked Excesses • Typical Software Engineering Approach • In Java • Object oriented • Immutable • Abstract types • No memory optimizations • No parallelization • Good Performance Engineering Approach • In C/Assembly • Memory optimized (blocked) • BLAS libraries • Parallelized (to 4 cores) 296,260x • In Comparison: Lowest to Highest MPG in transportation 14,700x 294,000x

  10. Joe the Parallel Programmer • Moore’s law is not bringing anymore performance gains • If Joe needs performance he has to deal with multicores • Joe has to deal with performance • Joe has to deal with parallelism Joe

  11. Why Parallelism is Hard • A huge increase in complexity and work for the programmer • Programmer has to think about performance! • Parallelism has to be designed in at every level • Programmers are trained to think sequentially • Deconstructing problems into parallel tasks is hard for many of us • Parallelism is not easy to implement • Parallelism cannot be abstracted or layered away • Code and data has to be restructured in very different (non-intuitive) ways • Parallel programs are very hard to debug • Combinatorial explosion of possible execution orderings • Race condition and deadlock bugs are non-deterministic and illusive • Non-deterministic bugs go away in lab environment and with instrumentation

  12. Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Joint work with Marek Olszewski and Jason Ansel • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace

  13. Racing for Lock Acquisition • Two threads • Start at the same time • 1st thread: 1000 instructions to the lock acquisition • 2nd thread: 1100 instructions to the lock acquisition Time Instruction #

  14. Non-Determinism • Inherent in parallel applications • Accesses to shared data can experience many possible interleavings • New! Was not the case for sequential applications! • Almost never part of program specifications • Simplest parallel programs, i.e. a work queue, is non deterministic • Non-determinism is undesirable • Hard to create programs with repeatable results • Difficult to perform cyclic debugging • Testing offers weaker guarantees 14

  15. Deterministic Multithreading • Observation: • Non-determinism need not be a required property of threads • We can interleave thread communication in a deterministic manner • Call this Deterministic Multithreading • Deterministic multithreading: • Makes debugging easier • Tests offer guarantees again • Supports existing programming models/languages • Allows programmers to “determinize” computations that have previously been difficult to do so using today’s programming idioms • e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and Delaunay Triangulation (Kulkarni et al. 2008)

  16. Deterministic Multithreading • Strong Determinism • Deterministic interleaving for all accesses to shared data for a given input • Attractive, but difficult to achieve efficiently without hardware support • Weak Determinism • Deterministic interleaving of all lock acquisitions for a given input • Cheaper to enforce • Offers same guarantees as strong determinism for data-race-free program executions • Can be checked with a dynamic race detector!

  17. Kendo • A Prototype Deterministic Locking Framework • Provides Weak Determinism for C and C++ code • Runs on commodity hardware today! • Implements a subset of the pthreads API • Enforces determinism without sacrificing load balance • Tracks progress of threads to dynamically construct the deterministic interleaving: • Deterministic Logical Time • Incurs low performance overhead (16% geomean on Splash2)

  18. Deterministic Logical Time • Abstract counterpart to physical time • Used to deterministically order events on an SMP machine • Necessary to construct the deterministic interleaving • Represented as P independently updated deterministic logical clocks • Not updated based on the progress of other threads (unlike Lamport clocks) • Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic Logical Time if: • Thread 1 has lower deterministic logical clock than Thread 2 at time of events

  19. Deterministic Logical Clocks • Requirements • Must be based on events that are deterministically reproducible from run to run • Track progress of threads in physical time as closely as possible (for better load balancing of the deterministic interleaving) • Must be cheap to compute • Must be portable over micro-architecture • Must be stored in memory for other threads to observe

  20. Deterministic Logical Clocks • Some x86 performance counter events satisfy many of these requirements • Chose the “Retired Store Instructions” event • Required changes to Linux Kernel • Performance counters are kernel level accessible only • Added an interrupt service routine • Increments each thread’s deterministic logical clock (in memory) on every performance counter overflow • Frequency of overflows can be controlled

  21. Locking Algorithm • Construct a deterministic interleaving of lock acquires from deterministic logical clocks • Simulate the interleaving that would occur if running in deterministic logical time • Uses concept of a turn • It’s a thread’s turn when: • All thread’s with smaller ID have greater deterministic logical clocks • All thread’s with larger ID have greater or equal deterministic logical clocks

  22. Locking Algorithm function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); enable_logical_clock(); } function det_mutex_unlock(l) { unlock(l); }

  23. Example Thread 1 Thread 2 t=3 t=5 Physical Time Deterministic Logical Time

  24. Example Thread 1 Thread 2 t=6 Physical Time t=11 Deterministic Logical Time It’s a race!

  25. Example Thread 1 Thread 2 Physical Time t=11 t=20 Deterministic Logical Time It’s a race!

  26. Example Thread 1 Thread 2 Physical Time t=18 Deterministic Logical Time t=25 det_lock(a)

  27. Example Thread 1 Thread 2 Physical Time t=18 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

  28. Example Thread 1 Thread 2 Physical Time det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

  29. Example Thread 1 Thread 2 Physical Time wait_for_turn() det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

  30. Example Thread 1 Thread 2 Physical Time lock() det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn()

  31. Example Thread 1 Thread 2 Physical Time det_lock(a) t=22 Deterministic Logical Time t=25 det_lock(a) wait_for_turn() Thread 2 will always acquire the lock first!

  32. Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time t=25 det_lock(a) wait_for_turn() t=26

  33. Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time t=25 det_lock(a) lock(a) t=26

  34. Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time t=25 det_lock(a) lock(a) det_unlock(a) t=32

  35. Example Thread 1 Thread 2 Physical Time det_lock(a) Deterministic Logical Time det_lock(a) t=28 det_unlock(a) t=32

  36. Locking Algorithm Improvements • Eliminate deadlocks in nested locks • Make thread increment its deterministic logical clock while it spins on the lock • Must do so deterministically • Queuing for fairness • Lock priority boosting • See ASPLOS09 Paper on Kendo for details

  37. Evaluation • Methodology • Converted Splash2 benchmark suite to run use the Kendo framework • Eliminated data-races • Checked determinism by examining output and the final deterministic logical clocks of each thread • Experimental Framework • Processor: Intel Core2 Quad-core running at 2.66GHz • OS: Linux 2.6.23 (modified for performance counter support)

  38. Results

  39. Effect of interrupt frequency

  40. Related Work • DMP – Deterministic Multiprocessing • Hardware design that provides Strong Determinism • StreamIt Language • Streaming programming model only allows one interleaving of inter-thread communication • Cilk Language • Fork/join programming model that can produce programs with semantics that always match a deterministic “serialization” of the code • Cannot be used with locks • Must be data-race free (can be checked with a Cilk race detector)

  41. Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks • Joint work with Jason Ansel, Cy Chan, Yee Lok Wong, Qin Zhao, and Alan Edelman • Conquering the Multicore Menace

  42. Observation 1: Algorithmic Choice • For many problems there are multiple algorithms • Most cases there is no single winner • An algorithm will be the best performing for a given: • Input size • Amount of parallelism • Communication bandwidth / synchronization cost • Data layout • Data itself (sparse data, convergence criteria etc.) • Multicores exposes many of these to the programmer • Exponential growth of cores (impact of Moore’s law) • Wide variation of memory systems, type of cores etc. • No single algorithm can be the best for all the cases

  43. Observation 2: Natural Parallelism • World is a parallel place • It is natural to many, e.g. mathematicians • ∑, sets, simultaneous equations, etc. • It seems that computer scientists have a hard time thinking in parallel • We have unnecessarily imposed sequential ordering on the world • Statements executed in sequence • for i= 1 to n • Recursive decomposition (given f(n) find f(n+1)) • This was useful at one time to limit the complexity…. But a big problem in the era of multicores

  44. Observation 3: Autotuning • Good old days  model based optimization • Now • Machines are too complex to accurately model • Compiler passes have many subtle interactions • Thousands of knobs and billions of choices • But… • Computers are cheap • We can do end-to-end execution of multiple runs • Then use machine learning to find the best choice

  45. PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } } • Implicitly parallel description A y y AB AB h h c w x B c x w

  46. PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } • Implicitly parallel description • Algorithmic choice A b1 AB AB a1 a2 B b2

  47. PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } // Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); } a AB AB ab1 ab2 B b1 b2

  48. PetaBricks Language transformMatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } // Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); } // Recursively decompose in h to(AB.region(0, 0, w, h/2) ab1, AB.region(0, h/2, w, h ) ab2) from(A.region(0, 0, c, h/2) a1, A.region(0, h/2, c, h ) a2, B b) { ab1=MatrixMultiply(a1, b); ab2=MatrixMultiply(a2, b); } }

  49. PetaBricksCompiler Internals Compiler Passes Choice Dependency Graph PetaBricks Source Code Compiler Passes ChoiceGrid Rule/Transform Headers ChoiceGrid ChoiceGrid Rule Bodies Rule Body IR ChoiceGrid Compiler Passes ChoiceGrid Sequential Leaf Code ParallelDynamically Scheduled Runtime Code Generation C++

  50. Choice Grids transformRollingSumfrom A[n] to B[n] { Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … } Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … } } Input A: n 0 Rule2 Rule1 or Rule2 B: n 0 1

More Related