750 likes | 764 Views
Hardware Memory Race Recording for Deterministic Replay. Mark D. Hill University of Wisconsin—Madison August 10, 2007 Based on joint work with Min Xu & Ras Bodik: ISCA 2003, ASPLOS 2006, IEEE Micro Top Picks 2007, & Xu UW Ph.D. 5/2006 (slides updated from defense talk).
E N D
Hardware Memory Race Recording for Deterministic Replay Mark D. Hill University of Wisconsin—Madison August 10, 2007 Based on joint work with Min Xu & Ras Bodik: ISCA 2003, ASPLOS 2006, IEEE Micro Top Picks 2007,& Xu UW Ph.D. 5/2006 (slides updated from defense talk).
Wisconsin Multifacet Project • Seek improved architectures for (mostly) servers thatare (mostly) chip multiprocessors (CMPs, multi-core) • Led by Mark Hill & David Wood • LogTM work w/ Ben Liblit & Mike Swift • Funding • Grants from U.S. National Science Foundation • Donations from Intel and Sun
Selected Multifacet Results (1 of 2) • Multiprocessor Flight Data Recorder • Records memory races for deterministic replay • Piggyback on coherence protocol & logs 0.001B/instrn • Supports SC & TSO • Adaptive L2 Cache & Memory Link Compression • Cache compression creates level 2½ cache (or 3½) • Adaptive so as “to do no harm” • Link compression husbands memory link bandwidth • Multifacet GEMS MP Simulation Infrastructure • Simics==Correctness; GEMS==Performance • GPL Distribution
Selected Multifacet Results (2 of 2) • Log-based Transactional Memory (LogTM) • Accelerates commit by writing new values in place(after saving old values in a per-thread log) • Gracefully handles cache eviction of TM data • LogTM Signature Edition (LogTM-SE) • Signatures summarize read/write sets • HW mechanisms: simple, policy-free, SW accessible • Forthcoming • Mechanisms to handle thread switching/migration & paging of transactions with OS or OS/VMM
Effective Inexpensive Long Recording More Applicable Low Overhead Low Cost Race Recorder Overview • Increasingly useful to replaymultithreaded code • Race recording: key to dealing with nondeterminism • A Case Study • Long recording: 1 byte/kilo-instr • Always-on recording: less than 2% overhead • Low cost: 24 KB RAM/core • Support both SC & TSO (x86-like)
Contributions Low Runtime Overhead Small Log Size Coherence Piggyback Transitive Reduction & Regulated TR Effective Inexpensive Order-Value Hybrid Set/LRU Approximation Low Cost Hardware SC & TSO Applicability
Outline 6 slides Motivation & Problem 21 An Effective and Inexpensive Race Recorder TR & RTR Algorithms Coherence Piggyback Set/LRU Approximation Order-Value Hybrid 6 Evaluation Method & Results 3 Conclusions, etc.
Multithreaded Debugging • % gdb a.out • gdb> run • Program received SIGSEGV. • In get() at hash.c:45 • 45 a = bucket->d; • % gcc hash.c • % a.out • Segmentation fault • % • % gcc para-hash.c • % a.out • Segmentation fault • % • % gdb a.out • gdb> run • Program exited normally. • gdb> • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;
Applications of Deterministic Replay • Deterministic Replay is logically recreating a program execution • Cyclic Debugging ([Pancake & Netzer ‘93]) • Fault Tolerance (ExtraVirt [Lucchetti et al. ’05]) • Intrusion Analysis (ReVirt [Dunlap et al. ’02]) • Data Recovery (Continuous Checkpointing)? • See VMware Workstation 6 • Replay included for single-processor guest VM
Log - X = X*5 - - Recording X= 6 Race Recording Thread I Thread J Thread I Thread J X = 1 X++ print(X) - - - X = X*5 - - X = X*5 - - X = 1 X++ print(X) Original Replay X=6 X=10
Focus Recording for Multithreaded Replay • Race Recording • Not-an-issue for a single thread • Create the same general & data races • Checkpointing • Provide a snapshot of the program state • Many proposals (e.g., SafetyNet), not focus • Input Recording • Provide repeatable inputs • Some proposals (e.g., part of FDR), not focus
A Good Race Recorder Low runtime overhead Applicability Low cost • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d; Long recording: small log
Our Recorder Desired & Existing Race Recorders Strata ASPLOS ’06 V V V X V V, but global
Small Log Size Coherence Piggyback Transitive Reduction & Regulated TR Order-Value Hybrid Set/LRU Approximation
Problem Formulation Dependence (black) Conflicts (red) Thread I Thread J Thread I Thread J ld A add ld A add st B st B st C st C st C Log st C ld B ld B ld D ld D st A st A sub sub st C st C ld B ld B st D st D Recording Replay • Reproduce exact same conflicts: no more, no less
Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Conflicts Thread I Thread J • Detect conflicts Write log ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • Assign IC • (logical Timestamps) • But too many conflicts
TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay
From I to J Vectors • Regulate Replay (RTR) From J to I Vectors The Intuition of the New RTR Algorithm After Reduction
New Reduced Log Log J: 23 45 Log I: 23 stricter Reduced Log Size: 48 bytes (6 integers) Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay
Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Vector Deps. Log Size: 40 bytes (5 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • Reduce log size to KB/core/second
Low Runtime Overhead Coherence Piggyback Transitive Reduction & Regulated TR Set/LRU Approximation Order-Value Hybrid
B.writer = (I, 2) C.writer =(J, 2) if (C.writer != I) log(WAW) foreach C.readers if (reader != I) log(WAR) C.readers.clear( ) C.writer = (I, 3) if (B.writer != J) log(RAW) B.readers.add(J,3) … Detect Conflicts A.readers A.writer Thread I Thread J A.readers.add(I, 1) 1 ld A add 1 st B st C 2 2 st C ld B 3 3 st A 4 Recording • Expensive in software
Get/S Request A.readers A.writer B.readers B.writer Data Response Timestamp Use Cache and Cache Coherence Proc I Proc J ld B Tag State Data Timestamp A S … 1 B M … 2 Tag State Data Timestamp A S … 3 B I … 2 RAW Detected & Logged • Detect conflict in hardware with little runtime cost
Ack Timestamp? Inv Get/S Cache Evictions and Writebacks Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 2 Tag State Data Timestamp A S … 3 B I … 2 M … 3 C M … 3 WAR Detected & Logged Directory of A: Shared(I,J) Owner() • OK with nonsilent eviction & directory eviction
Implement TR and RTR in Hardware • Ideal TR requires vector timestamps • Too expensive • New idea: Pairwise-TR (use scalar timestamp) • Enable pairwise transitive reduction • Optimal RTR algorithm is likely expensive • Implement a greedy RTR algorithm • One-pass, online algorithm • Keep a sliding window of vectorizable dependencies
Coherence Piggyback Transitive Reduction & Regulated TR Low Cost Hardware Set/LRU Approximation Order-Value Hybrid
C M … 3 Timestamp Approximation Thread I Thread J 1 ld A add 1 One Set of I’s $ Tag State Data Timestamp A S … 1 B M … 2 st B st C 2 2 st C ld B 3 3 Use current IC of thread I I ld D st A J Recording Directory of A: Shared(I) • Correct, but more evictions more logged conflicts
Hardware Cost Log Size
Thread I Thread J 1 ld A add 1 One Set of I’s $ Tag State Data Timestamp A S … 1 B M … 2 st B st C 2 2 C M … 3 st C ld B 3 3 LRU guarantee B’s TS > A’s TS Use current IC of thread I I ld D st A J Recording Set/LRU Approximation • Set/LRU better preserve reducibility • Small $ more misses but still small log
Hardware Cost of Timestamps Coupled Timestamp Memory • Coupled timestamp memory: overhead cache size • Not flexible • 64B line + 64b (24b) timestamp 12.5% (4.7%) overhead • 192 KB for a 4MB L2 • Need to modify cache Tag State Data Timestamp A S … 1 B M … 2
Cache Tag State Data A S … B M … Tag Timestamp A 1 B 2 Timestamp Memory Decoupled Timestamp Memory • Decoupling Small timestamp memory (Set/LRU) • e.g., 32-set, 64-way 99% transitive reduction • Timestamps Memory 24 KB • No need to modify cache Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 • From 192 KB to 24 KB: 8x reduction
Coherence Piggyback Transitive Reduction & Regulated TR Set/LRU Approximation Order-Value Hybrid SC & TSO Applicability
Thread I Thread J A=B=0 st A,1 st A,1 st B,1 ld A A=1 A=0 A=1 A=0 1 st A,1 st B,1 1 st B,1 ld B B=0 B=1 B=0 B=1 ld A ld B ld B st B,1 st A,1 st A,1 ld B ld A 2 2 ld A ld A ld B st B,1 SC TSO Recording with Total Store Order (TSO) • Majority of existing MP are non-SC • TSO is well defined, x86-like
A=0 B=0 TSO Execution I J A=1 B=1 st A,1 st B,1 Thread I Thread J WrBuf WrBuf ld A A=B=0 ld B 1 st A,1 st B,1 1 st A,1 ld B ld A 2 2 Memory System st B,1 A=0 A=0 B=0 B=0
Thread I Thread J 1 st A,1 st B,1 1 ld B ld A 2 2 A=0 Replay B=0 Value Used A=0 Order-Value-Hybrid Recording WAR Omitted Value Logged st A,1 Thread I Thread J I J A=1 B=1 st B,1 A=B=0 ld A 1 st A,1 st B,1 1 WrBuf WrBuf ld B ld B ld A st A,1 2 2 Recording st B,1 Memory System A Changed! A=0 A=0 B=0 B=0 J Starts to Monitor A I Starts to Monitor B I Stops Monitoring B
Hybrid Recording with TR and RTR • Hybrid recording • All loads get correct values • Hardware similar to OoO SC [Gharachorloo et al. ’91] • Hybrid + TR & RTR • TR will not use the omitted WAR in reduction • RTR vectorize dependencies more conservatively
Core 4 Core 1 TSM TSM Shared L2 Cache (L1 Dir) IC Core 3 Core 2 L1_I$ L1_D$ L1 Coherence Controller TSM TSM TSM Log TR Reg RTR Reg Put-it-together: Determinizer/CMP
Simulation Method • Commercial server hardware • GEMS: http://www.cs.wisc.edu/gems • Full-system (OS + application) executions • 4-core CMP (Sequential Consistent) • 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory • Commercial server software • Apache – static web serving • SpecJBB – middleware • OLTP – TPC-C like • Zeus – static web serving
KB/core/s byte/core/kilo-instr 200 2.0 150 1.5 100 1.0 50 0.5 0 0.0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Log Size: 1 byte/kilo-instr • Well within in the capability of current machines • Long recording (days – months) need improvement
Execution Time 100 100 80 80 60 60 40 40 20 20 0 0 Apache JBB OLTP Zeus Apache JBB OLTP Zeus Baseline With race recorder Runtime Overhead Interconnection Msg. B/W • Our recorder can be “always-on”
100 100 80 80 60 60 40 40 20 20 0 0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Perfect TSM 24KB Set/LRU TSM Benefits of RTR and Set/LRU (Log Size) Improvement by RTR Effectiveness of Set/LRU Log Size Log Size Pairwise-TR Our RTR
Why RTR and Set/LRU Work Well? • RTR • Processors execute instructions at similar speed • Therefore, we can find “vectorizable” dependencies • Set/LRU • Temporal locality makes the LRU timestamps old • We only need to know if a timestamp is “old-enough”
Sensitivity and Scalability • A design space of the timestamp memory (TSM) • Size: smaller TSM -> larger log • Read/write timestamp: should be used when TSM is large • Partial timestamp: 24-bit enough • Associativity: higher better for RTR • Scalability of the recorder • Studied with modest processors (2p – 16p) • Commercial workloads, not scientific workloads • Log size increase slowly with number of cores
Conclusions & Future Work • Race recording Key to combat nondeterminism • Contributions Effective & inexpensive Recorder • Transitive Reduction & RTR algorithm small log size • Coherencepiggyback Negligible slowdown • Timestamp approximation Low hardware cost • Order-value hybrid support SC & TSO • Future work • Operate with Hardware Transactional Memory • Seek to Eliminate Timestamp on Acknowledgements
Pull Shared Get/X Toward Recording w/ Snooping Protocols • Key problem is combined/implicit response • Not a problem for AMD Hammer Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 + Current IC WAR Detected & Logged
Ack Timestamp Eviction Get/S Timestamp Memory Timestamp at L2-Directory or Memory? Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 M … 4 C M … 3 Directory of A: Shared(J) Owner() StickyS(I,J) • Directory eviction: more false conflict, like snooping