180 likes | 248 Views
Automatically Classifying Benign and Harmful Data Races Using Replay Analysis. S. Narayanasamy , Z. Wang, J. Tigani , A. Edwards, B. Calder UCSD and Microsoft PLDI 2007. Motivation. Data Races hard to debug Difficult to detect Even more difficult to reproduce
E N D
Automatically Classifying Benign and Harmful Data Races Using Replay Analysis S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007
Motivation • Data Races hard to debug • Difficult to detect • Even more difficult to reproduce • Data Race Detectors help in detection • LockSet, Happens-Before and Atomicity Violation • But they tend to overdo it • Up to 90% false alarms • Especially with LockSet We need a tool that detects and reliably classifies all harmful Data Races
Algorithm Overview • Offline Dynamic Happens-BeforeData Race Detection • Step 1: Trace Capturing • Step 2: Offline Happens-Before Analysis • Step 3: Replay Critical Segments • Step 4: Auto Classify harmful vs. benign races
1. Trace Capturing & Replaying • iDNA captures the execution of an application • Simply records initial state, • Registers and PC • load values, • Only those needed absolutely • 1st load after a store, DMA etc… • and a global clock (sequensers) • Inserted in the thread’s replay log for • Synchronization events • System calls
2. Offline Happens-Before Analysis • Good old Happens-Before • Two conflicting accesses • At least one write • Not ordered • Detects only the data races that happened
3. Replay Critical Segments • When a data race is detected replay the affected segments twice • 1st with the actual order • Given by the load values • 2nd reverse the racing accesses • Store the replay result • No-State-Change: If all live-outs are the same • State-Change: If at least 1 live-out changed • ReplayFailure: If disaster encountered • Load null or unencountered address • Branch someplace else
3. Replay Critical Segments cont. Replay Failure Potentially Harmful Data Race
4. Automatic Classification • Repeat step 3 for each instance of a data race • Potentially Benign Data Race • every replay results to No-State-Change • Potentially Harmful Data Race • ≥1 replay results in State-Change or Replay Failure • State-Change shows that something would be different if things took the other path • Replay Failure indicates that a program changed that much, so we cannot simulate the other state • Concrete proof that something definitely changed • Easier for the programmers to accept it
Evaluation • 18 different executions of various services in Windows Vista and Internet Explorer • Happens-Before returns 16,642 data races • 68 unique • Trace capture • 0.8 bits per instruction • 96 MB per 1,000,000,000 instructions • Only 1st loads and synchronizers captured • 0.3 if compressed with zip
Slowdowns • Results for Internet Explorer • P4 Xeon 2.2 GHz, 1 GB of RAM • Start adding… • 6x for capturing • 10x for replaying (unnecessary) • 45x offline Happens-Before Data Race Detection • 280x replay analysis • 2,196 dynamic data races
Data Race Classifications Automatically Classified Manually Classified Impossible State Impossible State Half benign races identified correctly. Half still persist All harmful races identified correctly 0 false negatives
True Negatives • 32 Real Benign races classified as such • Every instance must return No-State-Changed • The more instances, the more confidence in the classification
True Positives Dangerous Zone • 7 Real bugs, correctly identified • At least 1 State-Change or Replay Failure required
False Positives • 29 Benign races incorrectly classified as harmful • Approximate Computation (23/29) • Statistics etc • Replayer Limitation (6/29) • At least 1 instance caused replay failure • The final outcome is the same
False Positives (cont.) • User Constructed • Garbage collector does not use locks • Double Checks • If (a) {lock(…); if(a) {…}} • Both Values Valid • Use cache? High Perf? • Redundant Writes • Rewrite the same value • Disjoint bit manipulation • Modify different bits in same variable 23 false positives that were not caused by replay failure
Conclusion • Interesting approach to identify benign races • It would be interesting to apply it to LockSet • LockSet has far more false positives • But it can detect bugs that did not happen in production runs • A grand total overhead is missing
Questions? Thank You!!!