1 / 32

Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , Zijiang James Yang*, and Cristiano L. Pereira ‡ Universi

Offline Symbolic Analysis for Multi-Processor Execution Replay. Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡. Overview.

landis
Download Presentation

Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , Zijiang James Yang*, and Cristiano L. Pereira ‡ Universi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Offline Symbolic Analysis forMulti-Processor Execution Replay Dongyoon Lee†, Mahmoud Said*, SatishNarayanasamy†, Zijiang James Yang*, and Cristiano L. Pereira‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡

  2. Overview Goal: Deterministic replay for multi-threaded programs • Debug non-deterministic bugs Sources of non-determinism • Program input (interrupt, I/O, DMA, etc.) • Shared-memory dependencies Past Solutions Our Solution Program Input BugNet[ISCA'05] Log loads (cache miss data) Log I/O, signals, DMA, etc., Shared Memory Dependency • Monitor memory operations • Software is slow • Hardware is complex SAT constraint solver Determine offline before replay

  3. Deterministic Replay Uses Reproduce non-deterministic bugs Memory Leaks Dynamic Program Analysis Data Races Debugging Step-Backward in time Dangling Pointers Replayer Developer Site Remote Site OR In-house Recorder

  4. Traditional Record-N-Replay Systems Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Log shared memory dependencies

  5. Recording Shared Memory Dependency Problem Need to monitor every memory operation Software-based Replay System PinSEL (UCSD/Intel) iDNA (Microsoft) Hardware-basedReplay System FDR/ReRun (Wisconsin) Strata (UCSD) DeLorean (UIUC) x100 x10 Complex hardware

  6. Hardware Complexity Hardware-basedsolution • Detectshared memory dependenciesby monitoring cache coherence messages • Transitive optimization to reduce log size Complexity • Requires changes to coherence sub-system • Complex to design and verify • 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] W(a) W(b) W(b) R(a)

  7. New Direction to Hardware-based Solution Complexity-effective solution • Do NOTrecord shared-memory dependencies at all • Infer dependencies offline before replay usingSatisfiability Modulo Theory (SMT) solver

  8. Our Approach Checkpoint Memory and Registers Checkpoint Registers BugNet[ISCA’05] Load-based Hardware Recorder Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline Log shared memory dependency

  9. Roadmap • Motivation • BugNet for single-threaded programs [ISCA’05] • Recording cache miss data is sufficient • BugNet is sufficient for multi-threaded programs • Insight: BugNet can replay each thread in isolation • Offline SMT Analysis • Evaluation • Conclusion

  10. BugNet[Narayanasamy et al, ISCA’05] Insight • Recording initial register state and values of loads is sufficient for deterministic replay • Implicitly captures the program input from I/O, DMA, interrupts, etc. • Input and output of other instructions are reproduced during replay Optimization • Record a load only if it is the first access to a memory location Our modification • Recording data fetched on cache miss captures first loads • Any first access to a location would result in a cache miss • May unnecessarily record data due to store misses, but that is OK

  11. Recording Cache Miss Data (First Loads) • Checkpoint • Register Values • Program Counter Log file Checkpoint • Record cache misses • (Memory count , Data) • Implicitly capture first loads Load A = 0 (cnt1, 0) Load B = 5 (cnt2, 5) • Deterministic Replay • Input and output (including • address) of all instructions • are replayed Load A = 0 (cnt3, 0) Store C = 1 • On a store miss • Record old value – data before • store update • New value – data after store • update – can be reproduced • deterministically Execution Time First Load Cache Miss

  12. BugNet Extension Self-modifying code • Consider instruction read as a load; so instructions are logged Full system Replay • Continue logging in kernel mode • See the paper for details on context switches, page faults, etc.

  13. Roadmap • Motivation • BugNet for single-threaded programs [ISCA’05] • Recording cache miss data is sufficient • BugNet is sufficient for multi-threaded programs • Insight: BugNet can replay each thread in isolation • Offline SMT Analysis • Evaluation • Conclusion

  14. BugNet for Multithreaded Programs Insight • BugNet recorder (initial register state + loads) for each thread is sufficient for replaying that thread • Recording cache miss data is sufficient for multithreaded programs • No additional hardware support required for recording dependencies Reason • Load dependent on a remote write cause a cache miss to ensure coherence • BugNetimplicitly records load values dependent on remote writes Effect • Can replay each thread in isolation (independent of other threads) using BugNet logs

  15. Replaying Each Thread Independently Proc 1 LOG Proc 2 LOG Proc 2 Proc 1 • Cache Coherence • Invalidate cache block • to gain exclusive • permission Load A=0 (1st, 0) Load A=0 Store A=1 (1st, 0) • Log cache miss data • Implicitly records loads • dependent on remote • writes • No change to • coherence mechanism Invalidation Cache Block Invalidated 1 (3rd, 1) Load A= • Replay each thread • independent of others Cache Miss

  16. Shared Memory Dependency x : Old Value Thread 1 Thread 2 : New Value A A Load Load ? C Store B Load • Billion instructions • Offline analysis • would not scale C Store A Store A Store C Store B Load Load A Load B Load B Final State : A, B, C SMT Solver resolves shared memory dependency We need to bound search space

  17. Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Encoding Ordering Constraints • Bounding Search Space • Evaluation • Conclusion

  18. Encoding Ordering Constraints Program Order Constraint (Assume Sequential Consistency) Proc1 : X1 < X2 AND Proc2 : X3 < X4 < X5 AND Load-Store Constraint ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 Proc 1 Proc 2 x x 1 3 x x 2 4 x 5 • OR X5 < X2) AND x Final x Old Value New Value

  19. Multiple Memory Locations Program Order Constraints (Assume Sequential Consistency) Proc1 : Y1 < X1 < X2 < Y2 AND Proc2 : X3 < X4 < X5 < Y3 AND Load-Store Constraints ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : Proc 1 Proc 2 x y x 1 3 1 x x 2 4 x 5 y y 2 3 y x Final Final x Old Value New Value

  20. Satisfiability-Modulo-Theory (SMT) Solver x Total Order x Ordering Constraints (Program Order) ∧ (Load-Store Order for X) ∧ (Load-Store Order for Y) ∧ : SMT Solver 1 3 y x x 2 4 1 x 5 y y • SMT solver • Find one valid total order from multiple solutions • All solutions could be produced, if needed 2 3

  21. Replay Guarantees • The replayed execution has the same final register and memory states • Each thread has the exactly same sequence of instructions along with input and output • Reconstructed shared memory dependencies obey program order and load-store semantics

  22. Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Encoding Ordering Constraints • Bounding Search Space • Evaluation • Conclusion

  23. Bounding Search Space Proc 1 Proc 2 • Record “Strata hints” • Each processor periodically • records memory operation count • Strata regions have a global order Strata Region 1 Ncycles Final State cnt 1 cnt 2 Strata Region 2 Initial State • SMT solver analyzes • One region at a time • Start from the last region • Final state of a region • = Initial state of the following region Ncycles Final State cnt 3 cnt 4 Strata Region 3 Initial State Final State Final State

  24. Strata Hints Cycle-bound • After N cycles, each core records its memory operation count • No communication is required between cores Problem • The size of Strata region is not based to number of shared memory dependencies • Can we bound based on number of shared memory dependencies? Downgrade-bound • Count coherencedowngrade requests • Requires communication between cores, but reduces offline analysis overhead

  25. Filtering Local & Read-only Accesses Thread 1 Thread 2 • Filter • Local accesses • : no shared-memory dependency • Read-only accesses • : any total order is valid Load C Load A Load B Store A Load C Store B Load C Load C Store B Load C Effectiveness < 1% of memory accesses remain to be analyzed Load C Strata Region

  26. Roadmap • Motivation • Record & Replay • Offline Symbolic Analysis • Evaluation • Strata Hint Size • Offline Symbolic Analysis Overhead • Conclusion

  27. Evaluation • Simics + cycle accurate simulator • Simulate multi-processor execution (2, 4, 8,16 cores) • Fast-forward up to known synchronization points • Trace collected for 500 million instructions • Benchmarks • SPLASH2 : barnes, fmm, ocean • Parsec 2.0 : blackscholes, bodytrack, x264 • SPEComp : wupwise, swim • Apache • MySQL • Yices SMT constraint solver [Dutertre and Moura CAV’06]

  28. Strata Hints Size vs. Offline Analysis Overhead Downgrade-bound (10) Downgrade-bound (25) Cycle-bound (10,000) x100 10% • Downgrade-bound scheme is effective • Offline analysis overhead is one-time cost (not for every replay)

  29. Strata hints vs. ReRun log ReRun[Hower and Hill, ISCA’08] Proposed System x4 • Strata hints are 4x less than ReRun log • Significant reduction in hardware complexity

  30. Recording Performance, etc. • Cache Miss Data Log • 290 Mbytes / one second of program execution • Recording Performance • On average, 0.35% slowdown in IPC • Scalability results can be found in the paper

  31. Conclusion • Deterministic replay for multi-threaded program is critical • We proposed a complexity-effective solution • Use BugNet : Record cache miss data • No need to record shared memory dependencies • Determine shared memory dependency using SMT constraint solver offline • Result • < 1% recording overhead • Efficient log size (4x smaller than state-of-the-art scheme ReRun) • Can analyze one second of 8-threaded program in less than 1000 seconds • One-time offline analysis cost (not for every replay)

  32. Thank you

More Related