Dongyoon Lee † , Mahmoud Said, Satish Narayanasamy † , Zijiang James Yang, and Cristiano L. Pereira ‡ Universi

Offline Symbolic Analysis forMulti-Processor Execution Replay Dongyoon Lee†, Mahmoud Said*, SatishNarayanasamy†, Zijiang James Yang*, and Cristiano L. Pereira‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡

Overview Goal: Deterministic replay for multi-threaded programs • Debug non-deterministic bugs Sources of non-determinism • Program input (interrupt, I/O, DMA, etc.) • Shared-memory dependencies Past Solutions Our Solution Program Input BugNet[ISCA'05] Log loads (cache miss data) Log I/O, signals, DMA, etc., Shared Memory Dependency • Monitor memory operations • Software is slow • Hardware is complex SAT constraint solver Determine offline before replay

Deterministic Replay Uses Reproduce non-deterministic bugs Memory Leaks Dynamic Program Analysis Data Races Debugging Step-Backward in time Dangling Pointers Replayer Developer Site Remote Site OR In-house Recorder

Traditional Record-N-Replay Systems Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Log shared memory dependencies

Recording Shared Memory Dependency Problem Need to monitor every memory operation Software-based Replay System PinSEL (UCSD/Intel) iDNA (Microsoft) Hardware-basedReplay System FDR/ReRun (Wisconsin) Strata (UCSD) DeLorean (UIUC) x100 x10 Complex hardware

Hardware Complexity Hardware-basedsolution • Detectshared memory dependenciesby monitoring cache coherence messages • Transitive optimization to reduce log size Complexity • Requires changes to coherence sub-system • Complex to design and verify • 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] W(a) W(b) W(b) R(a)

New Direction to Hardware-based Solution Complexity-effective solution • Do NOTrecord shared-memory dependencies at all • Infer dependencies offline before replay usingSatisfiability Modulo Theory (SMT) solver

Our Approach Checkpoint Memory and Registers Checkpoint Registers BugNet[ISCA’05] Load-based Hardware Recorder Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline Log shared memory dependency

Roadmap • Motivation • BugNet for single-threaded programs [ISCA’05] • Recording cache miss data is sufficient • BugNet is sufficient for multi-threaded programs • Insight: BugNet can replay each thread in isolation • Offline SMT Analysis • Evaluation • Conclusion

BugNet[Narayanasamy et al, ISCA’05] Insight • Recording initial register state and values of loads is sufficient for deterministic replay • Implicitly captures the program input from I/O, DMA, interrupts, etc. • Input and output of other instructions are reproduced during replay Optimization • Record a load only if it is the first access to a memory location Our modification • Recording data fetched on cache miss captures first loads • Any first access to a location would result in a cache miss • May unnecessarily record data due to store misses, but that is OK

Recording Cache Miss Data (First Loads) • Checkpoint • Register Values • Program Counter Log file Checkpoint • Record cache misses • (Memory count , Data) • Implicitly capture first loads Load A = 0 (cnt1, 0) Load B = 5 (cnt2, 5) • Deterministic Replay • Input and output (including • address) of all instructions • are replayed Load A = 0 (cnt3, 0) Store C = 1 • On a store miss • Record old value – data before • store update • New value – data after store • update – can be reproduced • deterministically Execution Time First Load Cache Miss

BugNet Extension Self-modifying code • Consider instruction read as a load; so instructions are logged Full system Replay • Continue logging in kernel mode • See the paper for details on context switches, page faults, etc.

Roadmap • Motivation • BugNet for single-threaded programs [ISCA’05] • Recording cache miss data is sufficient • BugNet is sufficient for multi-threaded programs • Insight: BugNet can replay each thread in isolation • Offline SMT Analysis • Evaluation • Conclusion

BugNet for Multithreaded Programs Insight • BugNet recorder (initial register state + loads) for each thread is sufficient for replaying that thread • Recording cache miss data is sufficient for multithreaded programs • No additional hardware support required for recording dependencies Reason • Load dependent on a remote write cause a cache miss to ensure coherence • BugNetimplicitly records load values dependent on remote writes Effect • Can replay each thread in isolation (independent of other threads) using BugNet logs

Replaying Each Thread Independently Proc 1 LOG Proc 2 LOG Proc 2 Proc 1 • Cache Coherence • Invalidate cache block • to gain exclusive • permission Load A=0 (1st, 0) Load A=0 Store A=1 (1st, 0) • Log cache miss data • Implicitly records loads • dependent on remote • writes • No change to • coherence mechanism Invalidation Cache Block Invalidated 1 (3rd, 1) Load A= • Replay each thread • independent of others Cache Miss

Shared Memory Dependency x : Old Value Thread 1 Thread 2 : New Value A A Load Load ? C Store B Load • Billion instructions • Offline analysis • would not scale C Store A Store A Store C Store B Load Load A Load B Load B Final State : A, B, C SMT Solver resolves shared memory dependency We need to bound search space

Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Encoding Ordering Constraints • Bounding Search Space • Evaluation • Conclusion

Encoding Ordering Constraints Program Order Constraint (Assume Sequential Consistency) Proc1 : X1 < X2 AND Proc2 : X3 < X4 < X5 AND Load-Store Constraint ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 Proc 1 Proc 2 x x 1 3 x x 2 4 x 5 • OR X5 < X2) AND x Final x Old Value New Value

Multiple Memory Locations Program Order Constraints (Assume Sequential Consistency) Proc1 : Y1 < X1 < X2 < Y2 AND Proc2 : X3 < X4 < X5 < Y3 AND Load-Store Constraints ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : Proc 1 Proc 2 x y x 1 3 1 x x 2 4 x 5 y y 2 3 y x Final Final x Old Value New Value

Satisfiability-Modulo-Theory (SMT) Solver x Total Order x Ordering Constraints (Program Order) ∧ (Load-Store Order for X) ∧ (Load-Store Order for Y) ∧ : SMT Solver 1 3 y x x 2 4 1 x 5 y y • SMT solver • Find one valid total order from multiple solutions • All solutions could be produced, if needed 2 3

Replay Guarantees • The replayed execution has the same final register and memory states • Each thread has the exactly same sequence of instructions along with input and output • Reconstructed shared memory dependencies obey program order and load-store semantics

Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Encoding Ordering Constraints • Bounding Search Space • Evaluation • Conclusion

Bounding Search Space Proc 1 Proc 2 • Record “Strata hints” • Each processor periodically • records memory operation count • Strata regions have a global order Strata Region 1 Ncycles Final State cnt 1 cnt 2 Strata Region 2 Initial State • SMT solver analyzes • One region at a time • Start from the last region • Final state of a region • = Initial state of the following region Ncycles Final State cnt 3 cnt 4 Strata Region 3 Initial State Final State Final State

Strata Hints Cycle-bound • After N cycles, each core records its memory operation count • No communication is required between cores Problem • The size of Strata region is not based to number of shared memory dependencies • Can we bound based on number of shared memory dependencies? Downgrade-bound • Count coherencedowngrade requests • Requires communication between cores, but reduces offline analysis overhead

Filtering Local & Read-only Accesses Thread 1 Thread 2 • Filter • Local accesses • : no shared-memory dependency • Read-only accesses • : any total order is valid Load C Load A Load B Store A Load C Store B Load C Load C Store B Load C Effectiveness < 1% of memory accesses remain to be analyzed Load C Strata Region

Roadmap • Motivation • Record & Replay • Offline Symbolic Analysis • Evaluation • Strata Hint Size • Offline Symbolic Analysis Overhead • Conclusion

Evaluation • Simics + cycle accurate simulator • Simulate multi-processor execution (2, 4, 8,16 cores) • Fast-forward up to known synchronization points • Trace collected for 500 million instructions • Benchmarks • SPLASH2 : barnes, fmm, ocean • Parsec 2.0 : blackscholes, bodytrack, x264 • SPEComp : wupwise, swim • Apache • MySQL • Yices SMT constraint solver [Dutertre and Moura CAV’06]

Strata Hints Size vs. Offline Analysis Overhead Downgrade-bound (10) Downgrade-bound (25) Cycle-bound (10,000) x100 10% • Downgrade-bound scheme is effective • Offline analysis overhead is one-time cost (not for every replay)

Strata hints vs. ReRun log ReRun[Hower and Hill, ISCA’08] Proposed System x4 • Strata hints are 4x less than ReRun log • Significant reduction in hardware complexity

Recording Performance, etc. • Cache Miss Data Log • 290 Mbytes / one second of program execution • Recording Performance • On average, 0.35% slowdown in IPC • Scalability results can be found in the paper

Conclusion • Deterministic replay for multi-threaded program is critical • We proposed a complexity-effective solution • Use BugNet : Record cache miss data • No need to record shared memory dependencies • Determine shared memory dependency using SMT constraint solver offline • Result • < 1% recording overhead • Efficient log size (4x smaller than state-of-the-art scheme ReRun) • Can analyze one second of 8-threaded program in less than 1000 seconds • One-time offline analysis cost (not for every replay)

Thank you

Dongyoon Lee † , Mahmoud Said, Satish Narayanasamy † , Zijiang James Yang, and Cristiano L. Pereira ‡ Universi