RTR: 1 Byte/Kilo- I nstruction Race Recording

RTR: 1 Byte/Kilo-InstructionRace Recording Rastislav Bodik Mark D. Hill Min Xu

Why Do You Need a Recorder? • % gdb a.out • gdb> run • Program received SIGSEGV. • In get() at hash.c:45 • 45 a = bucket->d; • % gcc sim.c • % a.out • Segmentation fault • % • % gcc para-sim.c • % a.out • Segmentation fault • % • % gdb a.out • gdb> run • Program exited normally. • gdb> • % gcc para-sim.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;

Applicability: Programs – data race Systems – non-SC Ideally … Long recording: small log Low runtime overhead Low cost • % gcc para-sim.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;

Better and Better Recorders

Hardware Acceleration [ISCA’03] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] A New Recorder 1 Byte/Kilo- Instruction [ASPLOS’06] • This talk covers only RTR • Regulated Transitive Reduction algorithm Result: One more step toward practical

Outline Race Recording RTR Algorithm Compress log during recording  replay more “regularly” Results with Commercial Workloads Conclusion

Technically, what’s race recording?

Log - X = X*5 - - Recording X= 6 Race Recording Thread I Thread J Thread I Thread J X = 1 X++ print(X) - - - X = X*5 - - X = X*5 - - X = 1 X++ print(X) Original Replay X=6 X=10

Terminologies and Assumptions Dependence (black) Conflicts (red) Thread I Thread J Thread I Thread J ld A add ld A add st B st B st C st C st C Log st C ld B ld B ld D ld D st A st A sub sub st C st C ld B ld B st D st D Recording Replay • Goal: Reproduce same conflicts with minimum log data

Regulated Transitive Reduction (RTR)

Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Conflicts Thread I Thread J ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • But too many conflicts

TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction (TR) Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • How to further reduce log size?

From I to J Vectors • “Regulate” Replay From J to I Vectors The Intuition of the RTR Algorithm After Reduction

New Reduced Log Log J: 23 45 Log I: 23 stricter sub st C 5 5 Reduced Log Size: 48 bytes (6 integers) ld B st D 6 6 Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 Replay • Fewer dependencies to log

Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Vector Deps. Log Size: 40 bytes (5 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • TRRTR: fewer deps + fewer byte/dep

Replay Cycle i:4j:1 j:2 i:3 i:4 Deadlock Avoidance of RTR Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Recording • Limit the strict dependencies (see paper)

Results with Commercial Workloads

C3 C0 L2 C2 C1 Full-system Simulation Method • Commercial server hardware • GEMS: http://www.cs.wisc.edu/gems • Full-system (OS + application) executions • 4-core CMP (Sequential Consistent) • 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory • Commercial server software • Apache – static web serving • SpecJBB – middleware • OLTP – TPC-C like • Zeus – static web serving

KB/core/s byte/core/KI 200 2.0 150 1.5 100 1.0 50 0.5 0 0.0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Log Size: 1 byte/KI • Less buffer, longer recording, smaller logs

100 80 60 40 20 0 Apache JBB OLTP Zeus AVG RTR vs. Netzer’s TR • 28% smaller log • TR was “optimal” Log Size TR RTR

Why Does RTR WorkWell? • RTR • Instructions execute at similar speed • Dependencies are often “vectorizable”

Hardware Acceleration [ISCA’03] 1 Byte/Kilo- Instruction [ASPLOS’06] A New Recorder • “Less hardware” & “TSO” not covered • Equally important • More details in the paper Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical

Conclusion • Race recording Counter nondeterminism • RTR1 byte/kilo-instruction • Based on Netzer’s transitive reduction • Create stricter dependencies • Vectorize dependencies to compress log • Avoid overly-strict hence no deadlock • Future work • Support snooping, SMT, replayer

RTR: 1 Byte/Kilo- I nstruction Race Recording