1 / 35

Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , and Zijiang James Yang*

Offline Symbolic Analysis to Infer Total Store Order. Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , and Zijiang James Yang* University of Michigan, Ann Arbor † Western Michigan University *. Deterministic Replay. What is deterministic replay?

reuben
Download Presentation

Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , and Zijiang James Yang*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Offline Symbolic Analysis toInfer Total Store Order Dongyoon Lee†, Mahmoud Said*, SatishNarayanasamy†, and Zijiang James Yang* University of Michigan, Ann Arbor † Western Michigan University *

  2. Deterministic Replay What is deterministic replay? • Record and reproduce non-deterministic events • Program input (interrupt, I/O, DMA, etc.) • Shared-memory dependencies Deterministic replay uses 1) Debugging • Reproducing concurrency bugs • Time-travel debugging 2) Heavyweight dynamic analysis 3) Forensics

  3. Traditional Deterministic Replay Systems Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Log shared memory dependencies

  4. Recording Shared-Memory Dependencies Problem • Need to monitor every memory operation Software-based replay systems • PinSEL(Intel), iDNA(Microsoft) • ODR (Berkeley), PRES (UCSD) • Respec/DoublePlay(Michigan) Hardware-based replay systems • RTR/ReRun(Wisconsin) • Strata (UCSD) • DeLorean(UIUC) • LReplay(CAS) • Timetraveler (Purdue) → 10-100x → Replay is not guaranteed → 2x throughput overhead : Total Store Order → Complex because precise recorder → Only few systems support relaxed consistency model

  5. Hardware Complexity in Previous Solutions Support for precisely logging shared-memory dependencies • Monitor and log cache coherence messages Support for Total Store Order (TSO) model • Detect sequential consistency (SC) violation • Log SC-violatingloads Complexity • Require invasive changes to coherence sub-system • Complex to design and verify • 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] RTR [Xu et al. ASPLOS’06]

  6. Our Approach Complexity-effective solution [MICRO’09] • Do NOT record shared-memory dependencies at all • Infer shared-memory dependencies offline • UsingSatisfiability Modulo Theory (SMT) solver Contribution • Find a causal order compliant to Total Store Order • Bound search space under TSO

  7. System Overview Thread 1 Thread 2 Thread 3 Checkpoint Memory and Registers Checkpoint Registers BugNet[ISCA’05] Load-based Hardware Recorder Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs thread interleaving offline Log shared memory dependency

  8. Roadmap • Motivation • Background: Load-based logging architecture [ISCA’05] T1 T2 T3 Checkpoint Registers BugNet[ISCA’05] Load-based Hardware Recorder Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline • Offline Symbolic Analysis • Bounding Search Space • Evaluation • Conclusion

  9. Load-based Logging Architecture [Narayanasamy et al, ISCA’05] [Lee et al. MICRO’09] Insight • Recording initial register state and values of loads is sufficient for deterministic replay • Implicitly captures the program input from I/O, DMA, interrupts, etc. • Input and output of other instructions are reproduced during replay Optimization • Record a load only if it is the first access to a memory location Processor support • Recording data fetched on a cache miss captures first loads • Any first access to a location would result in a cache miss • May unnecessarily record data due to store misses, but that is OK

  10. Recording Cache Miss Data (First Loads) • Checkpoint • Register Values • Program Counter Log file Checkpoint • Record cache misses • <Memory count , Data> • Implicitly capture first loads Load A = 0 <cnt1, 0> Load B = 5 <cnt2, 5> • Deterministic Replay • Input and output (including • address) of all instructions • are replayed Load A = 0 <cnt3, 0> Store C = 1 • On a store miss • Record old value – data before • store update • New value – data after store • update – can be reproduced • deterministically Execution Time First Load Cache Miss

  11. Cache Miss Logging for Multithreaded Programs Insight • Load-based recorder (initial register state + loads) for each thread is sufficient for replaying that thread => Recording cache miss data is sufficient for multithreaded programs => No additional hardware support required for recording dependencies Reason • Load dependent on a remote write will cause a cache miss to ensure coherence => Implicitly records load values dependent on remote writes Effect • Canreplay each thread in isolation (independent of other threads) regardless of underlying consistency model

  12. Replaying Each Thread Independently Proc 1 LOG Proc 2 LOG Proc 2 Proc 1 • Cache Coherence • Invalidates cache block • to gain exclusive • permission Load A=0 (1st, 0) Load A=0 Store A=1 (1st, 0) • Log cache miss data • Implicitly records loads • dependent on remote • writes • No change to • coherence mechanism Invalidation Cache Block Invalidated 1 (3rd, 1) Load A= Replay each thread independent of others Cache Miss

  13. Shared Memory Dependency x Old value Address Thread 1 Thread 2 New value Address B A Load Old value A Store ? A New value B Store Load • Billions of instructions • Offline analysis • may not scale C C Store Store A Store Store Final State : A, B, C SMT Solver for finding shared memory dependencies Strata for bounding search space

  14. Roadmap • Motivation • Load-based Logging Architecture • Offline Symbolic Analysis T1 T2 T3 Checkpoint Registers BugNet[ISCA’09] Load-based Hardware Recorder Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline • Bounding Search Space • Evaluation • Conclusion

  15. Offline Symbolic Analysis Step 1) Encode Ordering Constraints • Coherence Constraints • Memory Model Constraints (SC, TSO) Coherence Constraints (X1 < X2) AND (X2 < X4 OR X5 < X2) AND … Memory Model Constraints (Y1 < X1 < X2 < Y2) AND (X3 < X4 < X5 < Y3) AND …. y y x x 3 3 1 1 x x x x 1 1 4 4 x x x x 2 2 5 5 y y y y 2 2 3 3 Final State Step 2) Find a valid casual order • Satisfiability Module Theory (SMT) Solver • Yices[Dutertre and Moura CAV’06]

  16. Offline Symbolic Analysis ` Step 1) Encode Ordering Constraints • Coherence Constraints • Memory Model Constraints (SC, TSO) Coherence Constraints (X1 < X2) AND (X2 < X4 OR X5 < X2) AND … Memory Model Constraints (Y1 < X1 < X2 < Y2) AND (X3 < X4 < X5 < Y3) AND …. y y x x 3 3 1 1 x x x x 1 1 4 4 x x x x 2 2 5 5 y y y y 2 2 3 3 Final State Step 2) Find a valid casual order • Satisfiability Module Theory (SMT) Solver • Yices[Dutertre and Moura CAV’06]

  17. Encoding Coherence Constraints Proc 1 Proc 2 x Coherence Constraints ( M→old == M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 x 1 3 x x 2 • OR X5 < X2) AND 4 x 5 x Final x Old value Address New value

  18. Multiple Memory Locations Proc 1 Proc 2 Coherence Constraints ( M→old == M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y2 < Y3 AND : x x y 1 3 1 x x 2 4 x 5 y y 2 3 x y Final Final x Old value Address New value

  19. Offline Symbolic Analysis ` Step 1) Encode Ordering Constraints • Coherence Constraints • Memory Model Constraints (SC, TSO) Coherence Constraints (X1 < X2) AND (X2 < X4 OR X5 < X2) AND … Memory Model Constraints (Y1 < X1 < X2 < Y2) AND (X3 < X4 < X5 < Y3) AND …. y y x x 3 3 1 1 x x x x 1 1 4 4 x x x x 2 2 5 5 y y y y 2 2 3 3 Final State Step 2) Find a valid casual order • Satisfiability Module Theory (SMT) Solver • Yices[Dutertre and Moura CAV’06]

  20. Encoding Sequential Consistency Constraints Coherence Constraints ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : Proc 1 Proc 2 x x 1 y 3 1 x x 2 4 x 5 Memory Model Constraints (under Sequential Consistency) P1 : Y1 < X1 < X2 < Y2 AND P2 : X3 < X4 < X5 < Y3 AND y y 2 3 x y Final Final

  21. Total Store Order Relaxations Alice = Bob = 0 Stores Store Buffer Hit Store→Load P1 P2 SC Alice = 1 Bob = 1 TSO r1 = Bob r2 = Alice r1 = r2 = 0 Alice = Bob = Charlie = 0 Load →Load P2 P1 SC Alice = 1 Bob = 1 r3 = Charlie Charlie = 1 TSO r1 = Bob r2 = Alice r1 = r2 = 0, r3 = 1 Log memory counts of LoadSB-Hit Alice = Bob = 0 0 0 LoadSB-Hit→Load P1 P2 Alice = 1 • 1 Bob = 1 • 1 SC r3 = Alice r4= Bob TSO r1 = Bob r2 = Alice r1 = r2 = 0, r3 = r4 = 1

  22. Encoding Total Store Order Constraints Stores Store Buffer Hit SMT Solver Encoded TSO constraints Derived TSO schedule y y 1 x x x x 1 1 1 3 3 x x x x 2 2 4 4 Store LoadSB-Hit x x y y y y 5 5 2 3 2 Load 3 Load x x y y Final Final Final Final AND X1 < Y2 P1 : Y1 < X1 < X2 < Y2 P2 : X3 < X4 < X5 < Y3 AND X3 < Y3

  23. Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Bounding Search Space • Evaluation • Conclusion

  24. Bounding Search Space using Strata Proc 1 Strata • After every N broadcasted coherence messages, each processor records “Strata Hints” Strata regions are ordered SMT Solver • Analyzes one Strata region at a time • Starts from the last Strata region • Initial state of a Strata region = Final state of preceding region Proc 2 Coherence messages Strata Region 1 Nmsgs Final State hint 1 hint 2 Stratum Strata Region 2 Initial State Nmsgs Final State hint 3 hint 4 Strata Region 3 Stratum Initial State Final State Final State

  25. Strata Hint = Mem. Count + I-Store Count Strata hint • Every processor records • memory count • number of in-flight stores in the store buffer (called I-Store) • Reconstruct Stratum offline based on the memory count • Move the pending stores and dependent loadsto the next Strata region MemCnt. MemCnt. 1 0 1 0 Alice = Bob = 0 P1 P2 I-Store Cnt. I-Store Cnt. P1 P2 1 0 0 1 Alice = 1 Bob = 1 … remains in store-buffer Alice = 1 Bob = 1 Log {1,1} • Log • {1,1} {1,1} {1,1} r1 = Bob r2 = Alice … executed, not committed yet r1 = Bob r2 = Alice r1 = r2 = 0 r1 = r2 = 0 < Recording Strata Hints > < Reconstructing Strata Offline>

  26. Filtering Local, Read-only & Cache-hit Accesses Thread 1 Thread 2 Local accesses • No shared-memory dependencies Read-only accesses • Any order is valid Intermediate cache-hit accesses • No interleaved accesses on successive cache hits Effectiveness • < 1%of memory accesses remain Load A Load C Store A Load B Load C Store B Load C Load C Load C Load D Store D … Miss Store D … Hit Load D … Hit Store D … Hit Store D Strata Region

  27. Roadmap • Motivation • Load-based Logging Architecture • Offline Symbolic Analysis • Evaluation • Conclusion

  28. Evaluation Simulation Framework • Simics: simulate multi-processor execution (2, 4, 8,16 cores) • Modified FeS2 : simulate Total Store Order • Fast-forward up to known synchronization points • Simulated for 500 million instructions Benchmarks • SPLASH2: barnes, fmm, ocean • PARSEC2.0: blackscholes, bodytrack, x264 • SPEComp: wupwise, swim • Servers: apache, mysql Offline Symbolic Analysis • Yices SMT solver [Dutertre and Moura CAV’06]

  29. Program Input and LoadSB-Hit Log Size Program Input (Data & Instr. Cache Misses) LoadSB-Hit 192MB/sec • On average, 192 MB/sec (8 threads, TSO) • Dominated by cache miss logging • 2.45% of load instructions read their values from the store buffer

  30. Strata Hint Log Size SC TSO (with I-Stores) 1.2MB/sec • On average, 1.2MB/sec (8 threads, TSO, b-bound 10) • 15% increase over SC to log the number of pending stores (I-Store) • No hardware support for precise shared-memory dependency logging

  31. Offline Analysis Overhead SC TSO 260secs • On average, 260 seconds/sec (8 threads, TSO, b-bound 10) • Overall, 30% more efficient than SC • One time cost before replay

  32. Performance Overhead etc. Performance Overhead • Performance is mainly dominated by cache miss logging • Additional logging overhead for LoadSBH and I-Store is small • On average, <1% slowdown in IPC Paper includes more results on • Comparisons between different Strata bounding schemes • Effectiveness of cache hit filtering • Scalability results on different number of processors • …

  33. Conclusion Reproducing concurrent executions is a huge challenge Our proposal: a complexity-effective hardware solution for TSO • Record cache miss data and store buffer hits • Record Stratum (memory count + store buffer count) to bound search space • Determine shared memory dependencies using offline symbolic analysis Result (8 threads, TSO) • Performance overhead: less than 1% • Total log size: 193 MB/sec (Program input + Strata hints) • Offline analysis: 260 seconds/sec(30% more efficient than SC)

  34. Thank you

More Related