Deterministic Execution of Nondeterministic Shared-Memory Programs

Deterministic Execution of Nondeterministic Shared-Memory Programs Dan Grossman University of Washington Dagstuhl Seminar on Design and Validation of Concurrent Systems August 2009

What if… What if you could run the same multithreaded program on the same inputs twice and know you would get the same results? • What exactly does that mean? • Why might you want that? • How can we do that (semi-efficiently)? But first: • Some background on me and “the talks I’m not giving” • Key terminology and perspectives • More important than technical details at this event Dan Grossman: Determinism

Biography / group names Me: • “Programming-languages person” • Type systems, compilers for memory-safe C dialect 200-2004 • 30%  80% focus on multithreading, 2005- • Co-advising 3-4 students with computer architect Luis Ceze, 2007- Two groups for “marketing purposes” • WASP, wasp.cs.washington.edu • SAMPA, sampa.cs.washington.edu Dan Grossman: Determinism

The talk you won’t see void transferFrom(int amt, Acct other){ atomic{ other.withdraw(amt); this.deposit(amt); } } “Transactions are to shared-memory concurrency as garbage collection is to memory management” [OOPSLA 07] Semantic problems with nontransactional accesses: worse than locks! • Fix with stronger guarantees and compiler opts [PLDI07] • Or static type system, formal semantics, and proof [POPL08] • Or more dynamic approach adapting to Haskell [submitted] • … Prototypes for OCaml, Java, Scheme, and Haskell Dan Grossman: Determinism

This talk… Take an arbitrary C/C++ program with POSIX threads • Locks, barriers, condition variables, data races, whatever Compile it funny Link it against a funny run-time system Get deterministic behavior • Well, as deterministic as a sequential C program Joint work: Luis Ceze, Tom Bergan, Joe Devietti, Owen Anderson Dan Grossman: Determinism

Terminology Essential perspectives, not just definitions • Parallelism vs. concurrency • Or different terms if you prefer • Sequential semantics vs. determinism vs. nondeterminism • What is an input? • Level of abstraction • Which one do you care about? Dan Grossman: Determinism

Concurrency Working “definition”: Software is concurrent if a primary intellectual challenge is responding to external events from multiple sources in a timely manner. Examples: operating system, shared hashtable, version control Key challenge is responsiveness • often leads to threads or asynchrony Correctness usually requires synchronization (e.g., locks) Dan Grossman: Determinism

Parallelism Working “definition”: Software is parallel if a primary intellectual challenge is using extra computational resources to do more useful work per unit time. Examples: scientific computing, most graphics, a lot of servers Key challenge is Amdahl’s Law • No sequential bottlenecks, no imbalanced load When pure fork-join isn’t correct, need synchronization Dan Grossman: Determinism

The confusion • First, this use of terms isn’t standard • Many systems are both • And it’s really a matter of degree • Similar lower-level mechanisms, such as threads and locks • And similar errors (race conditions, deadlocks, etc.) • Our work determinizes these lower-level mechanisms, so we determinize concurrent and parallel applications • But purely parallel ones probably benefit less Dan Grossman: Determinism

Sequential semantics • Some languages can have results defined purely sequentially, but are designed to have better parallel-performance guarantees (thanks to a cost model) • Examples: DPJ, Cilk, NESL, … • For correctness, reason sequentially • For performance, reason in parallel • Really designed for parallelism, not concurrency • Not our work Dan Grossman: Determinism

Sequential isn’t always deterministic [Surprisingly easy to forget this] int f1(){ print(“A”); print(“B”); return 0; } int f2(){ print(“C”); print(“D”); return 0; } int g() { return f1() + f2(); } Must g() print ABCD? • Java: yes • C/C++: no, CDAB allowed, but not ACBD, ACDB, etc. Dan Grossman: Determinism

Another example Dijkstra’s guarded-command conditionals if x % 2 == 1 -> y := x - 1 [] x < 10 -> y := 7 [] x >= 10 -> y := 0 fi We might still expect a particular language implementation (compiler) to be deterministic • May choose any deterministic result consistent with the nondeterministic semantics • Presumably doesn’t change choice across executions, but may across compiles (including “butterfly effects”) • Our work does this Dan Grossman: Determinism

Why helpful? So programmer gets a deterministic executable, but doesn’t know which one • Key degree of freedom for automated performance Still helpful for: • Whole-program testing and debugging • Automated replicas • In general, repeatability and reducing possible executions Dan Grossman: Determinism

Define deterministic, part 1 Deterministic: “outputs depend only on inputs” • That’s right, but means must clearly specify what is an input (and an output) • Can define away anything you want • Example: All syscall results are inputs, so seeding the pseudorandom number generator with time-of-day is “deterministic” • We mean what you think we mean • Inputs: command-line, I/O, syscalls • Not inputs: cache state, hardware timing, thread scheduler Dan Grossman: Determinism

Define deterministic, part 2 “Is it deterministic?” depends crucially on your abstraction level • Another obvious easy-to-forget thing Examples: • File systems • Memory-allocation (Java vs. C) • Set implemented as a list • Quantum mechanics Our work: • The “language level”: state of logical memory, program output • Application may care only about a higher level (future work) Dan Grossman: Determinism

Okay… how? Trade-off between complexity and performance: PERFORMANCE COMPLEXITY Performance: • Overhead (single-thread slowdown) • Scalability (minimize extra synchronization, waiting) Dan Grossman: Determinism

load A store C load A … … … store B load B store C Starting serial Determinization is easy! • Run one thread at a time in round-robin order • Context-switch after N basic blocks for deterministic N • Cannot use a timer; use compiler and run-time • Races in source program are irrelevant; locks still respected Example with 3 threads running (time moves with arrows) T1 T2 T3 1 quantum 1 round Dan Grossman: Determinism

store C … load B Parallel quanta • The quanta in a round can start to run in parallel provided they stop before any communication occurs (see how next) • So each round has two stages, parallel then serial T1 T2 T3 Parallel stage ends with global barrier load A load A Serial stage ends; next round starts store B store C … … Dan Grossman: Determinism

store C load B Is that legal? T1 T2 T3 • Can produce different result than serial execution • In fact, execution not necessarily equivalent with any serialization of quanta But it doesn’t matter as long as we are deterministic! Just need: • Parallel stages do no communication • Parallel stages end at deterministic points load A load A store B store C Dan Grossman: Determinism

store C load B Performance T1 T2 T3 Keys to scalability: • Run almost everything in the parallel stage • Keep quanta balanced • Assume (1), use rough instruction costs load A load A store B store C Dan Grossman: Determinism

store C load B Memory ownership To avoid communication during parallel stage: • Every memory location is “shared” or “owned by 1 thread T” • Dynamic table checked and updated during execution • Can read only memory that is shared or owned-by-you • Can write only memory owned-by-you • Locks: just like memory locations + blocking ends quantum In our example, perhaps A is shared, B and C are owned by T2 T1 T2 T3 load A load A store B store C Dan Grossman: Determinism

Changing ownership Policy: For each location (any deterministic granularity is correct), • First owner is first thread to allocate in the location • On read in serial stage, if owned-by-other set to shared • One write in serial stage, set to owned-by-self Correctness: • Ownership immutable in parallel stages (so no communication) • Serial-stage changes are deterministic So many, many polices are correct • Chose the obvious one for temporal locality + read-sharing • Must have good locality for scalability! Dan Grossman: Determinism

Overhead Significant overhead: • All reads/writes consult ownership information • All basic blocks subtract from a thread-local quantum counter Reduce via: • Lots of run-time engineering and data structures (not too much magic, but most important) • Obvious compiler optimizations like escape analysis and hoisting counter-subtractions • Specialized compiler optimizations like Subsequent Access Optimization: Don’t recheck same ownership unless a quantum boundary might intervene. • Correctness of this is a subtle argument and slightly affects the ownership-change policy (deterministically!) Dan Grossman: Determinism

Brittle Change any line of code, command-line argument, environment variable, etc. and you can get a different deterministic program  We are mostly robust to memory-safety errors , except  • Bounds errors that corrupt ownership information • Bounds errors that write to another thread’s allegedly-thread-local data Dan Grossman: Determinism

Results Overhead: Varies a lot, but about 3x at 8 threads Scalability: Varies a lot, but on average with parsec suite (*) nondet 8 threads vs. nondet 2 threads = 2.4 (linear = 4) det 8 threads vs. det 2 threads = 2.0 det 8 threads vs. nondet 2 threads = 0.91 (range 0.41 - 2.75) “How do you want to spend Moore’s Dividend?” * subset runnable: no mpi, no C++ exceptions, no 32-bit assumptions Dan Grossman: Determinism

Buffering Actually, ownership is only one approach Second approach relies on buffering and a commit stage • Even higher overhead (to consult buffers) • Even better scalability (block only for synchronization & commits) And a third hybrid approach Hopefully more details soon Dan Grossman: Determinism

Conclusion The fundamental assumption that nondeterministic shared-memory programs must be run nondeterministically is false A fun problem to throw principled compiler and run-time optimizations at. Could dramatically change how we test and debug parallel and concurrent programs Most-related work: • Kendo from MIT: done concurrently (in parallel? ), requires knowing about data races statically, different approach • Colleagues in ASPLOS09: hardware support for ownership • Record & replay systems:we can replay without the record Dan Grossman: Determinism

Deterministic Execution of Nondeterministic Shared-Memory Programs

Deterministic Execution of Nondeterministic Shared-Memory Programs

Presentation Transcript

Shared-memory Architectures

Shared Memory Parallelism

Shared Memory and Shared Memory Consistency

Shared Memory Considerations

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs

OOO Execution of Memory Operations

Shared Memory

Distributed Shared Memory

Distributed shared memory

Distributed Shared Memory

Shared Memory

Distributed Shared Memory

DMP: Deterministic Shared Memory Multiprocessing

IPC: Shared Memory

Shared Memory Multiprocessors

MIMD Shared Memory

Shared Memory Multiprocessors

Implementation of Shared Memory

Shared Memory – Consistency of Shared Variables