Kendo: Efficient Deterministic Multithreading in Software

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos

Motivation • Parallel applications • Non-determinism inherent in threaded applications • Hard to develop, debug, test, maintain etc. • Modify running environment to make the parallel application run deterministically • Make thread communication through shared memory deterministic • Deterministic interleaving of lock acquisition

Deterministic Multithreading • Strong Determinism • Same output for every run – too costly • Weak Determinism • Same output for all the inputs that lead to a race-free execution under the deterministic scheduler.

Benefits of Deterministic Multithreading • Repeatability • Closest approach: record/replay systems can provide determinism for a single recorded run • Debugging • Cyclic debugging methodology • Testing • Test output or intermediate states of a program to justify correctness • Multithreaded Replicas • Replica-based fault tolerant • Give same input to replicas and expect same behavior

Deterministic Logical Time • ‘P’ monotonically increasing clocks, one for every thread • Counting arbitrary events (for every thread), that are repeatable across executions • e.g. writes performed, instructions committed • Measure of progress for every thread • Decide on the thread interleaving (lock acquisition) based on logical time

Simplified Locking Algorithm • At any given point it’s only one’s thread turn to acquire a lock: • All threads with a smaller ID have greater deterministic logical clocks • All threads with a larger ID have greater or equal deterministic logical clocks • Turn waiting enforces a First-Come-First-Serve ordering of threads in logical time

Pseudocode for simplified locking algorithm

Improved Locking Algorithm

Optimizations • Queueing for fairness • Queue structure in every lock • The thread at the head of the queue gets the lock; other threads spin increasing their logical clock • Deterministic logical clock fast-forwarding • A thread advances its clock to lock.released_logical_time to save time from spinning • Lock priority boosting (?) • If you can predict the next thread to get a lock, then decrease its clock to give it higher priority.

Implementation • Deterministic Logical Clocks • retire_stores hardware counter; on an overflow increment the software counter maintained in shared memory • Chunk size: number of stores needed to cause an overflow • Small chunk size higher overhead due to interrupt handlers • Increment amount: fidelity of the logical clock • Can be different when counter goes off and when trying to get a lock

Implementation • Thread Creation • Need to be careful when creating new threads • parent thread need to wait for its turn before initiating new thread • Lazy reads (unprotected reads) • Provide API for deterministically reading unprotected data, writes always done with a lock • Keep a table of all <values,logical times>

Evaluation • 2.66GHz Intel Core 2 Quad running Debbian • SPLASH-2 benchmark suite • also parallel traveling-sales-person (tsp) and parallel quicksort

Evaluation

Conclusions • Software-only solution to provide weak deterministic multithreading • Control the interleaving of lock acquisitions to make it deterministic • Low overhead (16%) for up to four threads (?) in SPLASH benchmarks

Kendo: Efficient Deterministic Multithreading in Software