730 likes | 886 Views
Effective and Inexpensive (Memory) Race Recording. Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood. Effective. Inexpensive.
E N D
Effective and Inexpensive(Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood
Effective Inexpensive Long Recording More Applicable Low Overhead Low Cost Race Recorder Overview • Increasingly useful to replaymultithreaded code • Race recording: key to dealing with nondeterminism • A Case Study • Long recording: 1 byte/kilo-instr • Always-on recording: less than 2% overhead • Low cost: 24 KB RAM/core • Support both SC & TSO (x86-like)
Thesis Contributions Low Runtime Overhead Small Log Size Coherence Piggyback RTR Algorithm Effective Inexpensive Order-Value Hybrid Set/LRU Approximation Low Cost Hardware SC & TSO Applicability
Outline 5 slides Motivation & Problem 21 An Effective and Inexpensive Race Recorder RTR Algorithm Coherence Piggyback Set/LRU Approximation Order-Value Hybrid 6 Evaluation Method & Results 3 Conclusion & My Other Research
Multithreaded Debugging • % gdb a.out • gdb> run • Program received SIGSEGV. • In get() at hash.c:45 • 45 a = bucket->d; • % gcc hash.c • % a.out • Segmentation fault • % • % gcc para-hash.c • % a.out • Segmentation fault • % • % gdb a.out • gdb> run • Program exited normally. • gdb> • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;
Log - X = X*5 - - Recording X= 6 Race Recording Thread I Thread J Thread I Thread J X = 1 X++ print(X) - - - X = X*5 - - X = X*5 - - X = 1 X++ print(X) Original Replay X=6 X=10
Focus Recording for Multithreaded Replay • Race Recording • Not-an-issue for a single thread • Create the same general & data races • Checkpointing • Provide a snapshot of the program state • Many proposals (e.g., SafetyNet), not focus • Input Recording • Provide repeatable inputs • Some proposals (e.g., part of FDR), not focus
A Good Race Recorder Low runtime overhead Applicability Low cost • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d; Long recording: small log
Our Recorder Desired & Existing Race Recorders
Small Log Size Coherence Piggyback RTR Algorithm Order-Value Hybrid Set/LRU Approximation
Problem Formulation Dependence (black) Conflicts (red) Thread I Thread J Thread I Thread J ld A add ld A add st B st B st C st C st C Log st C ld B ld B ld D ld D st A st A sub sub st C st C ld B ld B st D st D Recording Replay • Reproduce exact same conflicts: no more, no less
Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Conflicts Thread I Thread J • Detect conflicts Write log ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • Assign IC • (logical Timestamps) • But too many conflicts
TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay
From I to J Vectors • Regulate Replay (RTR) From J to I Vectors The Intuition of the New RTR Algorithm After Reduction
New Reduced Log Log J: 23 45 Log I: 23 stricter Reduced Log Size: 48 bytes (6 integers) Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay
Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Vector Deps. Log Size: 40 bytes (5 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • Reduce log size to KB/core/second
Low Runtime Overhead Coherence Piggyback RTR Algorithm Set/LRU Approximation Order-Value Hybrid
B.writer = (I, 2) C.writer =(J, 2) if (C.writer != I) log(WAW) foreach C.readers if (reader != I) log(WAR) C.readers.clear( ) C.writer = (I, 3) if (B.writer != J) log(RAW) B.readers.add(J,3) … Detect Conflicts A.readers A.writer Thread I Thread J A.readers.add(I, 1) 1 ld A add 1 st B st C 2 2 st C ld B 3 3 st A 4 Recording • Expensive in software
Get/S Request A.readers A.writer B.readers B.writer Data Response Timestamp Use Cache and Cache Coherence Proc I Proc J ld B Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 RAW Detected & Logged • Detect conflict in hardware with little runtime cost
Ack Timestamp? Inv Get/S Cache Evictions and Writebacks Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 M … 4 C M … 3 WAR Detected & Logged Directory of A: Shared(I,J) Owner() • OK with nonsilent eviction & directory eviction
Implement TR and RTR in Hardware • Ideal TR requires vector timestamps • Too expensive • New idea: Pairwise-TR (use scalar timestamp) • Enable pairwise transitive reduction • Optimal RTR algorithm is likely expensive • Implement a greedy RTR algorithm • One-pass, online algorithm • Keep a sliding window of vectorizable dependencies
Coherence Piggyback RTR Algorithm Low Cost Hardware Set/LRU Approximation Order-Value Hybrid
C M … 3 Timestamp Approximation Thread I Thread J 1 ld A add 1 One Set of I’s $ Tag State Data Timestamp A S … 1 B M … 2 st B st C 2 2 st C ld B 3 3 Use current IC of thread I I ld D st A J Recording Directory of A: Shared(I) • Correct, but more evictions more logged conflicts
Hardware Cost Log Size
Thread I Thread J 1 ld A add 1 One Set of I’s $ Tag State Data Timestamp A S … 1 B M … 2 st B st C 2 2 C M … 3 st C ld B 3 3 LRU guarantee B’s TS > A’s TS Use current IC of thread I I ld D st A J Recording Set/LRU Approximation • Set/LRU better preserve reducibility • Small $ more misses but still small log
Hardware Cost of Timestamps Coupled Timestamp Memory • Coupled timestamp memory: overhead cache size • Not flexible • 64B line + 64b (24b) timestamp 12.5% (4.7%) overhead • 192 KB for a 4MB L2 • Need to modify cache Tag State Data Timestamp A S … 1 B M … 2
Cache Tag State Data A S … B M … Tag Timestamp A 1 B 2 Timestamp Memory Decoupled Timestamp Memory • Decoupling Small timestamp memory (Set/LRU) • e.g., 32-set, 64-way 99% transitive reduction • Timestamps Memory 24 KB • No need to modify cache Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 • From 192 KB to 24 KB: 8x reduction
Coherence Piggyback RTR Algorithm Set/LRU Approximation Order-Value Hybrid SC & TSO Applicability
Thread I Thread J A=B=0 st A,1 st A,1 st B,1 ld A A=1 A=0 A=1 A=0 1 st A,1 st B,1 1 st B,1 ld B B=0 B=1 B=0 B=1 ld A ld B ld B st B,1 st A,1 st A,1 ld B ld A 2 2 ld A ld A ld B st B,1 SC TSO Recording with Total Store Order (TSO) • Majority of existing MP are non-SC • TSO is well defined, x86-like
A=0 B=0 TSO Execution I J A=1 B=1 st A,1 st B,1 Thread I Thread J WrBuf WrBuf ld A A=B=0 ld B 1 st A,1 st B,1 1 st A,1 ld B ld A 2 2 Memory System st B,1 A=0 A=0 B=0 B=0
Thread I Thread J 1 st A,1 st B,1 1 ld B ld A 2 2 A=0 Replay B=0 Value Used A=0 Order-Value-Hybrid Recording WAR Omitted Value Logged st A,1 Thread I Thread J I J A=1 B=1 st B,1 A=B=0 ld A 1 st A,1 st B,1 1 WrBuf WrBuf ld B ld B ld A st A,1 2 2 Recording st B,1 Memory System A Changed! A=0 A=0 B=0 B=0 Start Monitor A Start Monitor B Stop Monitor B
Hybrid Recording with TR and RTR • Hybrid recording • All loads get correct values • Hardware similar to OoO SC [Gharachorloo et al. ’91] • Hybrid + TR & RTR • TR will not use the omitted WAR in reduction • RTR vectorize dependencies more conservatively
Core 4 Core 1 TSM TSM Shared L2 Cache (L1 Dir) IC Core 3 Core 2 L1_I$ L1_D$ L1 Coherence Controller TSM TSM TSM Log TR Reg RTR Reg Put-it-together: Determinizer/CMP
Simulation Method • Commercial server hardware • GEMS: http://www.cs.wisc.edu/gems • Full-system (OS + application) executions • 4-core CMP (Sequential Consistent) • 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory • Commercial server software • Apache – static web serving • SpecJBB – middleware • OLTP – TPC-C like • Zeus – static web serving
KB/core/s byte/core/kilo-instr 200 2.0 150 1.5 100 1.0 50 0.5 0 0.0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Log Size: 1 byte/kilo-instr • Well within in the capability of current machines • Long recording (days – months) need improvement
Execution Time 100 100 80 80 60 60 40 40 20 20 0 0 Apache JBB OLTP Zeus Apache JBB OLTP Zeus Baseline With race recorder Runtime Overhead Interconnection Msg. B/W • Our recorder can be “always-on”
100 100 80 80 60 60 40 40 20 20 0 0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Perfect TSM 24KB Set/LRU TSM Benefits of RTR and Set/LRU (Log Size) Improvement by RTR Effectiveness of Set/LRU Log Size Log Size Pairwise-TR Our RTR
Why RTR and Set/LRU Work Well? • RTR • Processors execute instructions at similar speed • Therefore, we can find “vectorizable” dependencies • Set/LRU • Temporal locality makes the LRU timestamps old • We only need to know if a timestamp is “old-enough”
Sensitivity and Scalability • A design space of the timestamp memory (TSM) • Size: smaller TSM -> larger log • Read/write timestamp: should be used when TSM is large • Partial timestamp: 24-bit enough • Associativity: higher better for RTR • Scalability of the recorder • Studied with modest processors (2p – 16p) • Commercial workloads, not scientific workloads • Log size increase slowly with number of cores
Race Recording • Race recording Key to combat nondeterminism • My thesis An effective & inexpensive Recorder • RTR algorithm small log size • Coherencepiggyback Negligible slowdown • Timestamp approximation Low hardware cost • Order-value hybrid support SC & TSO • Future work • Improve race recording algorithm • Improve race recorder implementation • Study race replay
Shared Variables A “Critical Section” Serializability Violation Detector [PLDI’05] • Like a race detector • No a priori annotation requirement • “critical sections” are inferred • Intend to detect bugs “actually” happen • Check for a 2-Phase-Locking condition Read in1 Read local Write out1 Write local Read in2 Write out2
Publications • FDR (ISCA’03) • Adopted by UCSD BugNet (ISCA’05) • SVD (PLDI’05) • Cited by Vaziri et al. (POPL’06) • Influenced new data race definition • RTR, Set/LRU & Hybrid • Submitted for publication
Thank you! • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;
Acknowledgements • Joint work with my advisors • Mark Hill, Ras Bodik • Ph.D. Committee • David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau, Barton Miller • Multifacet Group • Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann, Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen • Affiliates & Companies • Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach, Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun
Deterministic Replay is Useful • Deterministic Replay is logically recreating a program execution • Present applications • Cyclic Debugging ([Pancake & Netzer ‘93]) • Fault Tolerance (ExtraVirt [Lucchetti et al. ’05]) • Intrusion Analysis (ReVirt [Dunlap et al. ’02]) • Future applications • Data Recovery • Replay-based Synchronization
Multicore and Multithreading • Multicore is common • AMD X2 • IBM Power 5/6, Cell • Intel Pentium D, Core Duo • Sun SPARC T1 • Multithreading is common • Server: high throughput • Scientific: high performance • Desktop/embedded: low response time