Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy

Chimera: Hybrid Program Analysis for Determinism DongyoonLee, Peter Chen, Jason Flinn, SatishNarayanasamy University of Michigan, Ann Arbor * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

Deterministic Replay Goal: record and reproduce multithreaded execution • Debugging concurrency bugs • Offline heavyweight dynamic analysis • Forensics and intrusion detection • … and many more uses Problem • Multithreaded record-and-replay is too slow (>2x) or requires custom hardware

Multithreaded Record-and-Replay is Slow Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Write Write Read Log shared memory dependencies

Replay for Data-Race-Free Programs is Cheap Data-race-free programs • Shared memory accesses are well ordered by synchronization ops. • Recording happens-before order of sync. ops. is sufficient Problem: Programs with data races T1 T2 T3 X=0 order of mem. ops. Y=0 order of sync. ops. Unlock(l) Lock(l) X=1 Y=1 Unlock(l) Z=1 Signal(c) Wait(c) X=2 Y=2 Z=2

Our Contribution: A Hybrid Analysis Sound static data race analysis • Add synchronizations for potential data races • Problem: Too many false positives Profilingnon-concurrent code regions Symbolic bounds analysis Chimera Data-race-free program P’ Potentially racy program P

Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Weak-lock Design • Evaluation • Conclusion

Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Weak-lockDesign • Evaluation • Conclusion

Static Data Race Analysis • Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07] • Protect all potential data-races using weak-locks • A new time-out lock which may be preempted (discussed later) • Record and replay the happens-before order of weak-locks

Protect Potential Races using Weak-locks Static analysis helps avoid instrumentation for access to Z void foo() { X = 0; for(i = ... ){ Y[ tid ][ i ] = 0; } } void bar() { X = 1; for(i = … ){ Y[ tid ][ i ] = 1; Z = 1; } } Potential racy-pair Potential racy-pair No race report

Sources of False Positives in RELAY • Sound data-race detector reports too many false data-races • 53x overhead • Source 1: Non-mutexsynchronizations are ignored • Lockset based analysis ignores fork-join, barrier, signal-wait, etc. • May report a false data-race between memory instructions that can never execute concurrently • Source 2: Conservative pointeranalysis • Overestimate variables accessed by a memory instruction • May report a false data-race between memory instructions that can never access the same location Solution: Profiling non-concurrent code regions Solution: Symbolic bounds Analysis

Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Weak-lock Design • Evaluation • Conclusion

Profiling Non-concurrent Code Regions Problem • Lockset based analysisignores non-mutex synchronization ops. Solution • Profile non-concurrent code regions (e.g., functions) • Increase the granularity of weak-locks to protect a larger code region instead of each potential racy instruction • Parallelism is preserved unless mis-profiled T1 T2 foo() False Race BARRIER BARRIER bar()

Function-Level Weak-Locks if profiler says foo() and bar() are not likely to run concurrently void foo() { X = 0; for(i = … ){ Y[ tid][ i ] = 0; } } void bar() { X = 1; for(i = … ){ Y[ tid][ i ] = 1; Z = 1; } } foo() False Race BARRIER BARRIER bar()

Roadmap • Motivation • Chimera Analysis • Static data race analysis • Profiling non-concurrent code regions • Symbolic bounds analysis • Design • Evaluation • Conclusion

Imprecision in Conservative Pointer Analysis May run Concurrently T1 T2 bar() foo() BARRIER BARRIER

Imprecision in Conservative Pointer Analysis • RELAY uses Steensgaard’s and Anderson’s pointer analysis • Flow-Insensitive and Context-Insensitive (FICI) analysis • Naming heap objects is conservative • Overestimate the variables accessed by a memory instruction void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … } } void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … } } Potential racy-pair False Race Thread 2 Thread1 Y[][] … … …

Symbolic Bounds Analysis Our Solution • Derive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI’00] • Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression • Parallelism is preserved if the bounds are precise enough void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } … } Symbolic Bounds Analysis Bounds: &Y[tid][0] to &Y[tid][N]

Loop-level Weak-locks Symbolic bounds: &Y[tid][0] ~ &Y[tid][N] void foo() { X = 0; for(i = 0 to N){ Y[ tid][ i ] = 0; } } void bar() { X = 1; for(i = 0 to N){ Y[ tid ][ i ] = 1; Z = 1; } } (&Y[tid][0],&Y[tid][N]) (&Y[tid][0],&Y[tid][N]) (&Y[tid][0],&Y[tid][N]) (&Y[tid][0],&Y[tid][N])

Imprecise Symbolic Bounds Sources • Depend on the value computed inside the code region • Depend on arithmetic operations not supported in the analysis • e.g.,modulo operations, logical AND/OR, etc. Choosing the optimal granularity • If bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism void qux() { … for(i = 0 to N){ prev= Z[ prev]; } … } Symbolic Bounds Analysis Bounds: -INF to +INF

Roadmap • Motivation • Chimera Analysis • Weak-lock Design • Evaluation • Conclusion

Deadlock due to Weak-locks No deadlocks between weak-locks • function-level > loop-level > instruction-level Deadlock between weak-locks and original sync. ops. is possible T1 T2 Time-out !! … wait (cv) … … signal(cv) …

Weak-lock Time-out A weak-lock might time-out • Invoke a special system call to handle it Current owner Current owner Time-out !! T2 T1 Logged order of weak-locks … signal(cv) … … wait (cv) … Weak-lock guarantee • Only one thread holds a given weak-lock at any given time • Mutual exclusion may be compromised; but sufficient for replay

Roadmap • Motivation • Chimera Analysis • Weak-lock Design • Evaluation • Conclusion

Implementation Source-to-source Instrumentation • Implemented in OCaml using CIL as a front end Static analysis • Data race detection: RELAY [Voung et al., FSE’07] • Include all library source codes for soundness (uClibc’slibc, libm, etc.) • Symbolic bounds analysis: [Rugina and Rinard, PLDI’00] • Intra-procedural analysis for racy loops only Runtime system • Modified Linux kernel to record/replay program input • Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locks

Evaluation Setup Test Environment • 2.66 GHz 8-core Xeon processor with 4 GB of RAM • Different set of inputs for profiling and performance evaluation • Average of five trials with 4 worker threads • 2, 4, 8 threads for scalability results Benchmarks • Desktop applications • aget, pfscan, and pbzip2 • Server programs • knot and apache • SPLASH-2 suite • ocean, water-nsq, fft, and radix

Record and Replay Performance 86% slowdown 39% 2.4% slowdown • Recording : 39% on average • Replay : similar to recording; much lower for I/O intensive prgs.

Effectiveness of Coarse-grained Weak-locks 53x

Effectiveness of Coarse-grained Weak-locks • Coarse-grained weak-locks reduce the cost of instrumentation

Effectiveness of Coarse-grained Weak-locks • Coarse-grained weak-locks reduce the cost of instrumentation • Exception: control-flow dependency (e.g., pfscan)

Effectiveness of Coarse-grained Weak-locks 1.39x • Coarse-grained weak-locks reduce the cost of instrumentation • Exception: control-flow dependency (e.g., pfscan)

Breakdown of Recording Overhead funclocks loop locks instr/bb locks sync op & system log • Weak-lock overhead = contention (waiting) cost + logging cost

Breakdown of Recording Overhead func wait func log loop wait loop log instr/bb wait instr/bb log sync op & system log • Weak-lock overhead = contention (waiting) cost + logging cost • High loop-lock contention • High instr/bb-lock contention

Scalability • Scientific applications scale worse due to imprecise symbolic bounds analysis

Conclusion Goal: Software-only deterministic multiprocessor replay systems Chimera Analysis • Static data race analysis • Find and protect potential data races with weak-locks • Instruction/basic-block-level weak-locks • Profiling non-concurrent code regions • Address the inadequacy of lockset-based algorithm • Function-level weak-locks • Symbolic bounds analysis • Address the imprecision of conservative pointer analysis • Loop-level weak-locks Low Recording Overhead • 39% recording overhead for 4 worker threads

Thank you

Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy