• 350 likes • 358 Views
Learn about lazy diagnosis for concurrency bugs to fix user impacting issues in rapid release cycles through a hybrid dynamic/static diagnosis technique.
E N D
Lazy Diagnosis of In-Production Concurrency Bugs Baris Kasikci, Weidong Cui, Xinyang Ge, Ben Niu
Why Does In-Production Bug Diagnosis Matter? • Potential to fix bugs that impact users • Short release cycles make in-house testing challenging • Release cycles can be as frequent as a few times a day1 [1] https://code.facebook.com/posts/270314900139291/rapid-release-at-massive-scale
Concurrency Bug Diagnosis Atomicity Violation Thread 2 Thread 1 Thread 1 Thread 2 R Time if (*x) { y = *x; } free(x); x = NULL; W R Concurrency bug diagnosis requires knowing the orderof key events (e.g., memory accesses)
Challenges of Concurrency Bug Diagnosis • Diagnosis requires reproducing bugs [PBI, ASPLOS’13] [Gist, SOSP’15] • Practitioners report that they can fix reproducible bugs [PLATEAU’14] • It may not be possible to reproduce in-production concurrency bugs • Inputs for reproducing bugs may not be available • Exposing bugs in production may incur high overhead [RaceMob, SOSP’13]
Record/Replay Atomicity Violation • Tracing fine-grained interleavings incurs high overhead • State-of-the-art record/replay has 28% overhead [DoublePlay, ASPLOS’11] Thread 2 Thread 1 Time R ΔT1 W ΔT2 R In theory, ΔT can be on the order of a nanosecond
Coarse Interleaving Hypothesis Atomicity Violation • Study with 54 bugs in 13 systems • Smallest ΔT is 91 microseconds Thread 2 Thread 1 Time R ΔT1 91 us W 10^5 ~ ~1ns ΔT2 R A lightweight, coarse-grained time tracking mechanism can help infer ordering
Lazy Diagnosis Leverages the coarse interleaving hypothesis Hybrid dynamic/static root cause diagnosis technique • Snorlax • Lazy Diagnosis Prototype • Fully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems) • Low overhead (always below < 2%)
Outline • Usage model • Design • Evaluation
Current Bug Diagnosis Model Root cause diagnosis
Lazy Diagnosis Usage Model Lazy Diagnosis Root cause + Root cause diagnosis Control- flow trace & Timing Info Control flow trace speeds up static analysis Coarse-grained timing information helps determine ordering
Outline • Usage model • Design • Evaluation
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Hybrid Points-to Analysis I1 Hybrid Points-to Analysis FAILURE (CRASH) store i32* %21, %bufSize store %Queue* %1, %q IF I2 load %Queue*, %fifo Finds instructions with operands pointing to the same location as the failing instruction’s operand
Hybrid Points-To Analysis • Uses the control flow traces to limit the scope of static analysis • Runs fast, scales to large programs (e.g., httpd, MySQL) • Lazy • Control flow traces trigger the analysis • Interprocedural • Bug patterns may span multiple functions • Flow-insensitive • Discards execution order of instructions for scalability
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Type-Based Ranking FAILURE (CRASH) load %Queue*, %fifo 1 Type-based Ranking store i32* %21, %bufSize store i32* %21, %bufSize 2 store %Queue* %1, %q store %Queue* %1, %q Highly ranks instructions operating on types that match the failing instruction's operand type
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Bug Pattern Computation Thread 1 Thread 1 Thread 2 Thread 2 Bug Pattern I Bug Pattern II Bug Pattern Computation Bug Pattern Computation load %Queue*, %fifo load %Queue*, %fifo load %Queue*, %fifo FAILURE store %Queue* %1, %q store %Queue* %1, %q store i32* %21, %bufSize store i32* %21, %bufSize
Bug Pattern Computation • Our implementation uses timing packets in Intel Processor Trace • Granularity of a few 10s of microseconds • We measured the smallest ΔT between key events as 91 microseconds Leverages the coarse interleaving hypothesis to establish instruction orders
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
Lazy Diagnosis Statistical Diagnosis Hybrid Points-to Analysis Bug Pattern Computation Type-based Ranking
store %Queue* %1, %q store %Queue* %1, %q load %Queue*, %fifo load %Queue*, %fifo load %Queue*, %fifo load %Queue*, %fifo Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 load %Queue*, %fifo load %Queue*, %fifo Thread 2 Thread 2 Thread 2 Thread 2 Thread 2 Thread 2 store %Queue* %1, %q store %Queue* %1, %q store %Queue* %1, %q store %Queue* %1, %q FAILURE (CRASH) SUCCESS SUCCESS SUCCESS SUCCESS FAILURE (CRASH) Statistical identification of failure predicting patterns
Outline • Usage model • Design • Evaluation
Evaluation of Snorlax • Is Snorlax effective? • Is Snorlax accurate? • Is Snorlax efficient? • How does Snorlax compare to its competition?
Experimental Setup • Real-world C/C++ programs • 11 concurrency bugs • Workloads from program’s test cases and test cases by other researchers
Snorlax’s Effectiveness • Snorlax correctly identified the root causes of 11 bugs • Determined after manual investigation of developer fixes • A single failure recurrence is enough for root cause diagnosis • In practice, for concurrency bugs, “event orders” = “root cause” Snorlax can effectively diagnose concurrency bugs
Snorlax’s Accuracy Accuracy Contribution All stages of Lazy Diagnosis are necessary for full accuracy
Snorlax’s Efficiency Percentage Overhead 0.97% Snorlax has low runtime performance overhead (always below 2%)
Snorlax vs. Gist 39% Percentage Overhead 3% 1.9% 0.9% Snorlax scales better than Gist with the increasing number of application threads
Lazy Diagnosis Leverages the coarse interleaving hypothesis Hybrid dynamic/static root cause diagnosis technique • Snorlax • Lazy Diagnosis Prototype • Fully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems) • Low overhead (always below < 2%) • Scales well with the increasing number of threads Michigan is hiring!