Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading

Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading Junfeng Yang Columbia University Joint with Heming Cui, Jingyue Wu, Yang Tang, Gang Hu, Yi-Hong Lin, Hao Li, Xinan Xu, Huayang Guo, John Gallagher, Chia-che Tsai (Columbia), Jiri Simsa, Ben Blum, Garth A. Gibson, Randal E. Bryant (CMU)

One-slide overview • Despite major advances in tools, multithreading remains hard to get right • Why? NondeterminismToo many thread interleavings, or schedules • Stable Multithreading (StableMT): a radical approach to reducing the set of schedules for reliability with low overhead [Tern OSDI 10][Loom OSDI 10][Peregrine SOSP 11][Specialization PLDI 12][Parrot SOSP 13][StableMTHotPar 13][StableMT CACM 14] Nondeterminism

StableMT: why and what

Multithreaded programs:pervasive and critical http://www.drdobbs.com/parallel/design-for-manycore-systems/219200099

But, extremely hard to get right • Plagued with concurrency bugs [Lu ASPLOS 09] • Data races, atomicity violations, execution order violations, deadlocks, … • Concurrency bugs: bad • Crashes, silent data corruptions, … • May be exploited by attackers to violate confidentiality, integrity, and availability of critical systems[Concurrency Attacks HotPar12]

Concurrency bug example Thread 1 Thread 2 Thread 1 Thread 2 • Input: everything a program reads from environment • E.g., main() arguments, data read from file or socket • Schedule: sequence of communication operations • E.g., total order of synchronizations such as lock()/unlock() • Buggy schedule: schedule triggering concurrency bug mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) *obj = … mutex_unlock(M) Based on real bugs in PBZip2, fft, lu, barnes, Apache

Why hard? • Number of schedules: exponential in K, M lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() M! schedules K critical sections >= (M!)K Schedules … Too manyschedules to check! Testing? Model checking?  M threads

How to improve checking coverage? All possible schedules • Coverage = Checked/All • Traditionally: enlarge Checked exploiting equivalence • Equiv. is hard to find • [DIR SOSP 11] took us 3 yrs • First since [VeriSoft POPL 97] Checked schedules Can we increase coverage without enlarging Checked?

High coverage is simple with StableMT • StableMT: over-constrain the ordering of communication operations to greatly reduce the total set of schedules • Example: force round-robin (RR) synchronization order • non-synchronization code still run in parallel lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() … Just one schedule to check! Simple enough that it feels like cheating 

Feasibleto reduce schedules? • Insight 1: for many programs, a wide range of inputs share the same set of schedules [Tern OSDI 10][Peregrine SOSP 11][Parrot SOSP 13] • Examples: PBzip2, C++ STL, MPlayer, ImageMagick, fft, barnes, lu, … • Insight 2: the overhead of enforcing a subset of schedules on all inputs is low (e.g., 15%) [Tern OSDI 10][Peregrine SOSP 11][Parrot SOSP 13] • More later

Conceptual view of StableMT • All inputs  a greatly reduced set of schedules • Benefit almost all reliability techniques • Provide anticipated robustness against input or program perturbations Tool + runtime co-design Stable multithreading Traditional multithreading “What you check is what you run” “What you can’t check is not run”

The stability measure • Measures how reduced the set of schedules is • One way to compute: • Maximum stability achieved when the set of schedules is reduced to minimum • Split input bits into control and computational bits • Control bits determine minimum # of schedules Reduced 1 - All

Roadmap • StableMT: why and what • Common misperception about nondeterminism, or why StableMTis better than deterministic multithreading (DMT) • Can we build StableMT systems? • How well does StableMT work?

Deterministic multithreading (DMT): one input  one schedule • Same program + same input = same schedule • E.g., one testing execution validates all future executions on the same input • Reproducing a concurrency bug requires only the input Stable multithreading Traditional multithreading Deterministic multithreading

Deterministic + low stability:hard to make reliable lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() Input 1 lock() … unlock() . . . lock() … unlock() Input 2 Input 3 … … Still too many schedules to check! Small input perturbations  drastically different schedules

Stability and determinism are often mistakenly conflated [StableMTHotPar 13][StableMT CACM 14] • “I simplified my testcase/ changed my code. Now the bug can’t be reproduced. Nondeterminism sucks!!! I want determinism!!!” • No, nondeterminism is not your problem; you want stability

Roadmap • StableMT: why and what • StableMT: better than DMT • Can we build StableMT systems? • Yes, we built three [Tern OSDI 10][Peregrine SOSP 11] [Parrot SOSP 13] • How well does StableMT work?

Implementation focus Control bits Many Degenerate to DMT-like fmm, pfscan • Future work: higher-level schedules e.g., PBzip2, C++ STL, MPlayer, ImageMagick, fft, barnes, lu e.g., Apache Few Concurrent Best case! Parallel • Windowing [Tern OSDI 08] to reduce input timing variations. • Future work: signal, async I/O, UI, …

The ultimate challenge: performance • Some schedules may serialize intended parallel computations (e.g., 5x slowdown on 30% programs) comp(…) lock() unlock() … lock() unlock() lock() unlock() comp(…) lock() unlock() lock() unlock() … lock() unlock() Seems a hardproblem! Want schedules to (1) not change for each input (stability); and (2) cater to each individual input (speed) lock() unlock() … lock() unlock() But has simple solutions!

Insight • Empirical study of 100+ programs • Most threads spend majority of time in a small number of hotspots • Example: PBZip2 • Another instance of the 80-20 rule • Keep hotspots parallel  smalloverhead!

Our solutions

1stsol'n: record and reuse[Tern OSDI 10][Peregrine SOSP 11] • On new input, run program as is to record reasonably fast synchronization schedule • Compute relaxed, quickly checkable preconditionof the schedule to capture dependencies on input • Reuse schedule on inputs satisfying precondition if(x == 1) { lock(); unlock(); } else …; // no synch if(y == 1) …; // no synch else …; // no synch Solution: symbolic execution to track constraints and precondition slicing to remove unnecessary constraints Precondition should constrain x, but noty

Handling server programs • Challenges • Run continuously  infiniteschedule • Nondeterministic input timing • Observation • Server programs tend to go back to the same quiescent states Solution: windowing to split continuous request stream into epochs

Handling data races • May cause execution to deviate from schedule x = 1; x = 0; if(x) { lock(); unlock(); } a[x] = 1; a[x] = 0; if(a[y]) { lock(); unlock(); } Solution: custom race detector to detect such races, then custom instrumentation engine to deterministically resolve races at runtime

Pros and cons of 1st solution • Transparent • But sophisticated enough that it needed [Tern OSDI 10][Loom OSDI 10][Peregrine SOSP 11] to explain • Trade some transparency for simplicity?

2ndsol'n: performance hints [Parrot SOSP 13] • By default, the Parrot thread runtime runs synchronizations round-robin • When necessary, developers add performance hints to their code for speed • Soft barrier: “coschedule these computations” • Performance critical section: “get through this section of code fast” • Evaluation on 100+ programs shows that hints are easy to add and make executions fast • https://github.com/columbia/smt-mc/

Example based on PBZip2 main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } pthread_mutex_lock(&mu); add(worklist, block); pthread_cond_signal(&cv); pthread_mutex_unlock(&mu); pthread_mutex_lock(&mu); // termination logic elided while (empty(worklist)) pthread_cond_wait(&cv, &mu); char *block = get(worklist); pthread_mutex_unlock(&mu);

Example based on PBZip2 main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } • Schedules are independent of file data  Same schedule can compress any file • Hotspots: compress() $ LD_PRELOAD=parrot.so ./a.outfile_with_two_blocks

Parrot’s scheduler main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } run queue: threads not blocked by synch wait queue: threads blocked by synch parrot_synchronization_wrapper() { while (self != head of run queue) yield; do actual synchronization; wake up threads if needed; move self to tail of run queue or wait queue; }

Parrot’s default round-robin schedule main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } run queue main wait queue consumer 1 consumer 2 • consumer 1 • consumer 2 • main • read_block • Add • read_block • Get • compress • Add Serialized! • Get General problem! • compress • Wait

Parrot schedule with perf hints • consumer 1 • consumer 2 • main main thread: create 2 consumer threads; soba_init(2); for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; soba_wait(); compress(block); } • read_block • Add • read_block • Get • soba_wait blocks • Add • Get • soba_wait returns • soba_wait Run in parallel! • compress • compress • 0.8% overhead

Performance hint API // soft barrier; doesn’t increase # of schedules void soba_init(intcount, void *chan = NULL, intdeterministic_timeout = 20); void soba_wait(void *chan = NULL); // performance critical section; increases # of // schedules, but can check! void pcs_enter(); void pcs_exit();

Roadmap • StableMT: why and what • StableMT: better than DMT • Can we build StableMT systems? • Yes, we built a few [Tern OSDI 10][Peregrine SOSP 11] [Parrot SOSP 13] • How well does StableMT work?

Evaluation questions • How fast is StableMT? • How much can StableMT improve reliability?

Evaluation Setup • A diverse set of of 108 programs • 55 Real-world programs: BerkeleyDB, OpenLDAP, Redis, MPlayer, ImageMagick, C++ STL, PBZip2, pfscan, aget • 53programs from 4 complete synthetic benchmark suites: PARSEC, SPLASH2X, PHOENIX, NPB. • Diverse: Pthreads, OpenMP, data partition, fork-join, pipeline, map-reduce, and workpile. • Maximum allowed cores (24-core Xeon) • Largest allowed or representative workloads

Overhead: small aget redis pfscan openldap berkeley db pbzip2_compress mplayermencoder pbzip2_decompress Mean overhead: 6.9% for real-world, 19.0% for synthetic, and 12.7% for all

Hints: easy to add, effective • Average to 1.2 lines per program • A few hints in common libs benefit many programs • 0.5--2 hours per program added by inexperienced students who didn’t write the programs

Evaluation questions • How fast is StableMT? • How much can StableMT improve reliability?

Model checking: higher coverage[Parrot SOSP 13] • Integrated Parrot with dBug [dBug SPIN 11] because it’s open-source, runs on Linux, implements dynamic partial order reduction [DPOR POPL 05], can estimate number of possible schedules [Knuth] • Parrot decreases the number of schedules by 106--1019734(not a typo ;) for 53 programs • Parrot increases the number of programs whose schedules dBug can exhaust from 43 to 99

Input coverage study • Can schedules checked cover all or at least a wide range of inputs? • Yes! [Tern OSDI 10][Peregrine SOSP 11] • Examples • Given the number of threads and the number of file blocks, PBZip2 needs only one schedule to compress all files (baring signals) • Schedules checked consistently avoided bugs in fft, barnes, lu for all inputs

Static analysis: higher precision[Specialization PLDI 12] • Specialize a program according to a schedule • Resultant program contains schedule info, improving precision of stock analysis Program Schedule Specialization Specialized Program C/C++ program with Pthread Schedule Total order of synchronizations

Sound static race detector # of false positives

Previously unknown harmful races detected • 4 in aget • 2 in radix • 1 in fft

Conclusion • Root cause of the multithreading difficulties: nondeterminismtoo many schedules • Stable Multithreading (StableMT): a radical approach to vastly reducing schedules for reliability with low overhead [Tern OSDI 10][Loom OSDI 10][Peregrine SOSP 11][Specialization PLDI 12][Parrot SOSP 13][StableMTHotPar 13][StableMT CACM 14] Stable multithreading Traditional multithreading Deterministic multithreading

Future work • StableMT: hugely benefit almost all parallel programming languages, models, tools, etc

Future work (cont.) • In general, need software be this complex? Verify Other, slower methods to handle. Or just give up …

Joint work with • Heming Cui, Jingyue Wu, Yang Tang, Gang Hu, Yi-Hong Lin, Hao Li, Xinan Xu, Huayang Guo, John Gallagher, Chia-che Tsai (Columbia) • Jiri Simsa, Ben Blum, Garth A. Gibson, Randal E. Bryant (CMU)

Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading

Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading

Presentation Transcript

Half is Not Enough

Beta is not Sharpe Enough ….

Y our Pension is NOT Enough

Sorry Is Not Enough

TasP is not enough

Learning is not enough

Asserting and Checking Determinism for Parallel Programs

Why legislation is not enough…

When “Being There” is Not Enough

Compatibilism: freedom is compatible with determinism

Asserting and Checking Determinism for Parallel Programs

Enough Is Enough

When Good is NOT Good Enough

Conservation is not enough:

Not enough

One Election Is Not Enough

When remediation is not enough . . .

What is determinism?

When Love Is Not Enough

Beta is not Sharpe Enough ….

Enough is enough!