260 likes | 502 Views
Deterministic Multiprocessing. Chris Fallin, David Lewis, Zongwei Zhou. What is Deterministic MP?. Multiprocessor executes multiple threads Threads share resources (ie, memory) Due to bus arbiters, memory controllers, etc, some orderings in shared resources are undefined
E N D
Deterministic Multiprocessing Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation
What is Deterministic MP? • Multiprocessor executes multiple threads • Threads share resources (ie, memory) • Due to bus arbiters, memory controllers, etc, some orderings in shared resources are undefined • Problem for: debugging (reproducibility), thorough testing (many possible cases) • Deterministic: same input same output
Types of Determinism • Strong: same input same output, regardless of race conditions • Must capture all communicating memory access pairs • Weak: same input same output, as long as locking is correct • Takes advantage of locks for low SW overhead
Types of Deterministic Execution • Record/Replay: HW/SW keeps log of program input • Single-program: system calls, memory interleavings • Full-system: interrupts, I/O, etc • Log allows later replay of a bug • However, several executions may still differ outside of replay • Full-time • Ordering of memory accesses follows a statically-defined deterministic order: for same program and same input, output is always same
DMP: Deterministic Shared Memory Multiprocessing Devietti, Lucia, Ceze, Oskin
Central Idea • To guarantee deterministic behavior: - the direct way is to preserve the same global interleaving of instructions in every execution of a parallel program - unnecessary and significant performance impact • Insight: only communicating pairs matter
Improve a bit....... • Not all memory access is communicating • can parallelize communication-free portion in each quantum • need to know when communications happen! • MESI cache coherence protocol provides this for free DMP Sharing Table - tracks info about mem ownership - two ownership change possibilities: - reading data owned by others - writing data to shared memory
Improve a bit more...... • Transactional Memory + deterministic commit order • TM: atomic and isolation of quantum • Speculation: find quantum not involved in communication • If communication happens, squash + re-execute • potential optimization: • forward uncommitted (or speculative) data between quanta • could save a large number of squashes
Discussion • Speculation • similar idea, but use for opposite purpose to TLS • require complex hardware • I/O or parts of OS can not execute speculatively • Dealing with nondeterminism • threads can use OS to communicate • nondeterministic OS API calls, e.g. read • Better way of token-passing?
Kendo: Efficient Deterministic Multithreading in Software Olszewski, Ansel, Amarasinghe
Definitions • Strong Determinism • Deterministic order of memory accesses to shared data for particular program input • ALWAYS produces same output for every run with a particular input • Not easily providable without hardware support • Weak Determinism • Deterministic order of lock acquisitions for a given program input • Produces same output for every run if race-free • Can be guaranteed if all accesses to shared data protected by locks • If no data-races, strong and weak determinism provide same guarantees!
Introducing Kendo • Software framework to enforce weak determinism of general lock-based C/C++ code for commodity shared-memory multiprocessors • No special hardware necessary! • Deterministic Logical Time • Each thread has its own monotonically increasing deterministic logical clock • How to implement? Performance counter events? • When is it a thread T's turn to use a lock? • All threads with tid < T have greater logical clocks • All threads with tid ≥ T have greater or equal logical clocks
Simple Locking Mechanism function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); resume_logical_clock(); } function det_mutex_unlock(l) { unlock(l); } • Simple algorithm for implementing locks • Pause logical clock during acquisition and wait for turn to access lock (using heuristic in previous slide) • Once in critical section resume the clock and continue • Pros: • Easy to implement • Problems?
Improved Lock function det_mutex_lock(l){ pause_logical_clock(); while(true){ // Loop until we have successfully acquired the lock . wait_for_turn(); // Wait for our deterministic logical clock to be unique global minimum if (try_lock(l)){ // Check the state of the lock , acquiring it if it is free if(l.released_logical_time // Lock is free in physical time, but still acquired in >= get_logical_clock()){ // deterministic logical time so we cannot acquire it yet unlock(l); // Release the lock } else { // Lock is free in both physical and in deterministic logical break; // time, so it is safe to exit the spin loop } } inc_logical_clock(); // Increment our deterministic logical clock and start over } inc_logical_clock(); // Increment our deterministic logical clock before exiting resume_logical_clock(); } function det_mutex_unlock(l){ pause_logical_clock(); l.released_logical_time = get_logical_clock(); unlock(l); inc_logical_clock(); resume_logical_clock(); }
Optimizations • Queuing • Queue for each lock guarantees first-come first-serve • Fast-forwarding • While waiting for a lock can set logical time to lock.released_logical_time (or +1 if queuing) • Lazy reads • If application can read out-of-date shared data, no need to lock on read (i.e. finding a "best" value) • Provide read window (in logical time), if all threads past earliest allowable logical time, can successfully read
Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Montesinos, Hicks, King, Torellas
Capo: Motivation • Record/replay system for debugging • Not intended to be deployed in the field • Builds on DeLorean [1] • Chunk-based record/replay system • Terminate chunks at communicating pairs, record chunk commit orderonly • Only half the story • Capo adds software side as a Linux implementation: • Record syscall results • Provide infrastructure to record/replay multiple programs and multiplex hardware record/replay features [1] P. Montesinos, L. Ceze, and J. Torrellas, “DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently,” in ISCA, June 2008.
Capo's Contributions • Replay Spheres: distinct realms of record/replay • Defining hardware-software interface • Simulated DeLorean hardware (chunk-based recording) • Linux kernel modifications
Capo Architecture • Replay Sphere: set of R-threads; isolated environment • Arbitrary set of processes is inside sphere • Replay Sphere Mgr: multiplexes HW support over spheres • HW: records chunk commit order (DeLorean) • SW: records system calls • OS not inside sphere, except copy_to_user()
Performance Record Replay
Helps with… Capo(record/replay) Kendo DMP debugging testing replicas deployment Needs hw usually no yes Summary (Devietti et al)
Discussion • Which is more useful: record/replay or full-time? • Debugging only, vs. system design philosophy • Tradeoff: cost (log size, overhead) vs. utility • Strong vs. weak determinism • Race conditions are an important class of bugs