1.31k likes | 1.55k Views
Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading. Junfeng Yang Columbia University
E N D
Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading Junfeng Yang Columbia University Joint with Heming Cui, Jingyue Wu, Yang Tang, Gang Hu, Yi-Hong Lin, Hao Li, Xinan Xu, Huayang Guo, John Gallagher, Chia-che Tsai (Columbia), Jiri Simsa, Ben Blum, Garth A. Gibson, Randal E. Bryant (CMU)
One-slide overview • Despite major advances in tools, multithreading remains hard to get right • Why? NondeterminismToo many thread interleavings, or schedules • Stable Multithreading (StableMT): a radical approach to reducing the set of schedules for reliability with low overhead [Tern OSDI 10][Loom OSDI 10][Peregrine SOSP 11][Specialization PLDI 12][Parrot SOSP 13][StableMTHotPar 13][StableMT CACM 14] Nondeterminism
Multithreaded programs:pervasive and critical http://www.drdobbs.com/parallel/design-for-manycore-systems/219200099
But, extremely hard to get right • Plagued with concurrency bugs [Lu ASPLOS 09] • Data races, atomicity violations, execution order violations, deadlocks, … • Concurrency bugs: bad • Crashes, silent data corruptions, … • May be exploited by attackers to violate confidentiality, integrity, and availability of critical systems[Concurrency Attacks HotPar12]
Concurrency bug example Thread 1 Thread 2 Thread 1 Thread 2 • Input: everything a program reads from environment • E.g., main() arguments, data read from file or socket • Schedule: sequence of communication operations • E.g., total order of synchronizations such as lock()/unlock() • Buggy schedule: schedule triggering concurrency bug mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) *obj = … mutex_unlock(M) Based on real bugs in PBZip2, fft, lu, barnes, Apache
Why hard? • Number of schedules: exponential in K, M lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() M! schedules K critical sections >= (M!)K Schedules … Too manyschedules to check! Testing? Model checking? M threads
How to improve checking coverage? All possible schedules • Coverage = Checked/All • Traditionally: enlarge Checked exploiting equivalence • Equiv. is hard to find • [DIR SOSP 11] took us 3 yrs • First since [VeriSoft POPL 97] Checked schedules Can we increase coverage without enlarging Checked?
High coverage is simple with StableMT • StableMT: over-constrain the ordering of communication operations to greatly reduce the total set of schedules • Example: force round-robin (RR) synchronization order • non-synchronization code still run in parallel lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() … Just one schedule to check! Simple enough that it feels like cheating
Feasibleto reduce schedules? • Insight 1: for many programs, a wide range of inputs share the same set of schedules [Tern OSDI 10][Peregrine SOSP 11][Parrot SOSP 13] • Examples: PBzip2, C++ STL, MPlayer, ImageMagick, fft, barnes, lu, … • Insight 2: the overhead of enforcing a subset of schedules on all inputs is low (e.g., 15%) [Tern OSDI 10][Peregrine SOSP 11][Parrot SOSP 13] • More later
Conceptual view of StableMT • All inputs a greatly reduced set of schedules • Benefit almost all reliability techniques • Provide anticipated robustness against input or program perturbations Tool + runtime co-design Stable multithreading Traditional multithreading “What you check is what you run” “What you can’t check is not run”
The stability measure • Measures how reduced the set of schedules is • One way to compute: • Maximum stability achieved when the set of schedules is reduced to minimum • Split input bits into control and computational bits • Control bits determine minimum # of schedules Reduced 1 - All
Roadmap • StableMT: why and what • Common misperception about nondeterminism, or why StableMTis better than deterministic multithreading (DMT) • Can we build StableMT systems? • How well does StableMT work?
Deterministic multithreading (DMT): one input one schedule • Same program + same input = same schedule • E.g., one testing execution validates all future executions on the same input • Reproducing a concurrency bug requires only the input Stable multithreading Traditional multithreading Deterministic multithreading
Deterministic + low stability:hard to make reliable lock() … unlock() . . . lock() … unlock() lock() … unlock() . . . lock() … unlock() Input 1 lock() … unlock() . . . lock() … unlock() Input 2 Input 3 … … Still too many schedules to check! Small input perturbations drastically different schedules
Stability and determinism are often mistakenly conflated [StableMTHotPar 13][StableMT CACM 14] • “I simplified my testcase/ changed my code. Now the bug can’t be reproduced. Nondeterminism sucks!!! I want determinism!!!” • No, nondeterminism is not your problem; you want stability
Roadmap • StableMT: why and what • StableMT: better than DMT • Can we build StableMT systems? • Yes, we built three [Tern OSDI 10][Peregrine SOSP 11] [Parrot SOSP 13] • How well does StableMT work?
Implementation focus Control bits Many Degenerate to DMT-like fmm, pfscan • Future work: higher-level schedules e.g., PBzip2, C++ STL, MPlayer, ImageMagick, fft, barnes, lu e.g., Apache Few Concurrent Best case! Parallel • Windowing [Tern OSDI 08] to reduce input timing variations. • Future work: signal, async I/O, UI, …
The ultimate challenge: performance • Some schedules may serialize intended parallel computations (e.g., 5x slowdown on 30% programs) comp(…) lock() unlock() … lock() unlock() lock() unlock() comp(…) lock() unlock() lock() unlock() … lock() unlock() Seems a hardproblem! Want schedules to (1) not change for each input (stability); and (2) cater to each individual input (speed) lock() unlock() … lock() unlock() But has simple solutions!
Insight • Empirical study of 100+ programs • Most threads spend majority of time in a small number of hotspots • Example: PBZip2 • Another instance of the 80-20 rule • Keep hotspots parallel smalloverhead!
1stsol'n: record and reuse[Tern OSDI 10][Peregrine SOSP 11] • On new input, run program as is to record reasonably fast synchronization schedule • Compute relaxed, quickly checkable preconditionof the schedule to capture dependencies on input • Reuse schedule on inputs satisfying precondition if(x == 1) { lock(); unlock(); } else …; // no synch if(y == 1) …; // no synch else …; // no synch Solution: symbolic execution to track constraints and precondition slicing to remove unnecessary constraints Precondition should constrain x, but noty
Handling server programs • Challenges • Run continuously infiniteschedule • Nondeterministic input timing • Observation • Server programs tend to go back to the same quiescent states Solution: windowing to split continuous request stream into epochs
Handling data races • May cause execution to deviate from schedule x = 1; x = 0; if(x) { lock(); unlock(); } a[x] = 1; a[x] = 0; if(a[y]) { lock(); unlock(); } Solution: custom race detector to detect such races, then custom instrumentation engine to deterministically resolve races at runtime
Pros and cons of 1st solution • Transparent • But sophisticated enough that it needed [Tern OSDI 10][Loom OSDI 10][Peregrine SOSP 11] to explain • Trade some transparency for simplicity?
2ndsol'n: performance hints [Parrot SOSP 13] • By default, the Parrot thread runtime runs synchronizations round-robin • When necessary, developers add performance hints to their code for speed • Soft barrier: “coschedule these computations” • Performance critical section: “get through this section of code fast” • Evaluation on 100+ programs shows that hints are easy to add and make executions fast • https://github.com/columbia/smt-mc/
Example based on PBZip2 main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } pthread_mutex_lock(&mu); add(worklist, block); pthread_cond_signal(&cv); pthread_mutex_unlock(&mu); pthread_mutex_lock(&mu); // termination logic elided while (empty(worklist)) pthread_cond_wait(&cv, &mu); char *block = get(worklist); pthread_mutex_unlock(&mu);
Example based on PBZip2 main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } • Schedules are independent of file data Same schedule can compress any file • Hotspots: compress() $ LD_PRELOAD=parrot.so ./a.outfile_with_two_blocks
Parrot’s scheduler main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } run queue: threads not blocked by synch wait queue: threads blocked by synch parrot_synchronization_wrapper() { while (self != head of run queue) yield; do actual synchronization; wake up threads if needed; move self to tail of run queue or wait queue; }
Parrot’s default round-robin schedule main thread: create 2 consumer threads; for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; compress(block); } run queue main wait queue consumer 1 consumer 2 • consumer 1 • consumer 2 • main • read_block • Add • read_block • Get • compress • Add Serialized! • Get General problem! • compress • Wait
Parrot schedule with perf hints • consumer 1 • consumer 2 • main main thread: create 2 consumer threads; soba_init(2); for each file block { char *block = read_block(); Add; } consumer thread: while(1) { Wait or Get; soba_wait(); compress(block); } • read_block • Add • read_block • Get • soba_wait blocks • Add • Get • soba_wait returns • soba_wait Run in parallel! • compress • compress • 0.8% overhead
Performance hint API // soft barrier; doesn’t increase # of schedules void soba_init(intcount, void *chan = NULL, intdeterministic_timeout = 20); void soba_wait(void *chan = NULL); // performance critical section; increases # of // schedules, but can check! void pcs_enter(); void pcs_exit();
Roadmap • StableMT: why and what • StableMT: better than DMT • Can we build StableMT systems? • Yes, we built a few [Tern OSDI 10][Peregrine SOSP 11] [Parrot SOSP 13] • How well does StableMT work?
Evaluation questions • How fast is StableMT? • How much can StableMT improve reliability?
Evaluation Setup • A diverse set of of 108 programs • 55 Real-world programs: BerkeleyDB, OpenLDAP, Redis, MPlayer, ImageMagick, C++ STL, PBZip2, pfscan, aget • 53programs from 4 complete synthetic benchmark suites: PARSEC, SPLASH2X, PHOENIX, NPB. • Diverse: Pthreads, OpenMP, data partition, fork-join, pipeline, map-reduce, and workpile. • Maximum allowed cores (24-core Xeon) • Largest allowed or representative workloads
Overhead: small aget redis pfscan openldap berkeley db pbzip2_compress mplayermencoder pbzip2_decompress Mean overhead: 6.9% for real-world, 19.0% for synthetic, and 12.7% for all
Hints: easy to add, effective • Average to 1.2 lines per program • A few hints in common libs benefit many programs • 0.5--2 hours per program added by inexperienced students who didn’t write the programs
Evaluation questions • How fast is StableMT? • How much can StableMT improve reliability?
Model checking: higher coverage[Parrot SOSP 13] • Integrated Parrot with dBug [dBug SPIN 11] because it’s open-source, runs on Linux, implements dynamic partial order reduction [DPOR POPL 05], can estimate number of possible schedules [Knuth] • Parrot decreases the number of schedules by 106--1019734(not a typo ;) for 53 programs • Parrot increases the number of programs whose schedules dBug can exhaust from 43 to 99
Input coverage study • Can schedules checked cover all or at least a wide range of inputs? • Yes! [Tern OSDI 10][Peregrine SOSP 11] • Examples • Given the number of threads and the number of file blocks, PBZip2 needs only one schedule to compress all files (baring signals) • Schedules checked consistently avoided bugs in fft, barnes, lu for all inputs
Static analysis: higher precision[Specialization PLDI 12] • Specialize a program according to a schedule • Resultant program contains schedule info, improving precision of stock analysis Program Schedule Specialization Specialized Program C/C++ program with Pthread Schedule Total order of synchronizations
Sound static race detector # of false positives
Sound static race detector # of false positives
Sound static race detector # of false positives
Sound static race detector # of false positives
Previously unknown harmful races detected • 4 in aget • 2 in radix • 1 in fft
Conclusion • Root cause of the multithreading difficulties: nondeterminismtoo many schedules • Stable Multithreading (StableMT): a radical approach to vastly reducing schedules for reliability with low overhead [Tern OSDI 10][Loom OSDI 10][Peregrine SOSP 11][Specialization PLDI 12][Parrot SOSP 13][StableMTHotPar 13][StableMT CACM 14] Stable multithreading Traditional multithreading Deterministic multithreading
Future work • StableMT: hugely benefit almost all parallel programming languages, models, tools, etc
Future work (cont.) • In general, need software be this complex? Verify Other, slower methods to handle. Or just give up …
Joint work with • Heming Cui, Jingyue Wu, Yang Tang, Gang Hu, Yi-Hong Lin, Hao Li, Xinan Xu, Huayang Guo, John Gallagher, Chia-che Tsai (Columbia) • Jiri Simsa, Ben Blum, Garth A. Gibson, Randal E. Bryant (CMU)