200 likes | 281 Views
Deterministic Logging/Replaying of Applications. Motivation. Run-time framework goals Collect a complete trace of a program’s user-mode execution Keep the tracing overhead for both space and time low
E N D
Motivation Run-time framework goals • Collect a complete trace of a program’s user-mode execution • Keep the tracing overhead for both space and time low • Re-simulate the traced execution deterministically based on the collected trace with full fidelity down to the instruction level • Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks Advantages • Complete program trace that can be analyzed from multiple perspectives (replay analyzers: debuggers, locality, etc) • Trace can be collected on one machine and re-played on other machines (or perform live analysis by streaming) Challenges: Trace Size and Performance
Original Record-Replay Approaches • InstantReplay ’87 • Record order or memory accesses • overhead may affect program behavior • RecPlay ’00 • Record only synchronizations • Not deterministic if have data races • Netzer ’93 • Record optimal trace • too expensive to keep track of all memory locations • Bacon & Goldstein ’91 • Record memory bus transactions with hardware • high logging bandwidth
Motivation Increasing use and development for multi-core processors • MT program behavior is non-deterministic • To effectively debug software, developers must be able to replay executions that exhibit concurrency bugs • Shared memory updates happen in different order
Related Concepts • Runtime interpretation/translation of binary instructions • Requires no static instrumentation, or special symbol information • Handle dynamically generated code, self modifying code • Recording/Logging: ~100-200x • More recent logging • Proposed hardware support (for MT domain) • FDR (Flight Data Recorder) • BugNet (cache bits set on first load) • RTR (Regulated Transitive Reduction) • DeLorean (ISCA 2008- chunks of instructions) • Strata (time layer across all the logs for the running threads) • iDNA (Diagnostic infrastructure using NirvanA- Microsoft)
Deterministic Replay Re-execute the exact same sequence of instructions as recorded in a previous run • Single threaded programs • Record Load Values needed for reproducing behavior of a run (Load Log) • Registers updated by system calls and signal handlers (Reg Log) • Output of special instructions: RDTSC, CPUID (Reg Log) • System call (virtualization- cloning arguments, updates) • Checkpointing (log summary ~10Million) • Multi-threaded programs • Log interleaving among threads (shared memory updates ordering – SMO Log)
PinSEL – System Effect Log (SEL) Logging program load values needed for deterministic replay: • First access from a memory location • Values modified by the system (system effect) and read by program • Machine and time sensitive instructions (cpuid,rdtsc) Store A; (A 111) Program execution Store B; (B 55) Load C; (C = 9) Load D; (D = 10) Syscall modifies location (B 0) and (C 99) Load A; (A = 111) system call Load B; (B = 0) Logged Load C; (C = 99) Not Logged Load D; (D = 10) • Trace size is ~4-5 bytes per instruction
i = 1; for (j = 0; j < 10; j++) { i = i + j; } k = i; // value read is 46 System_call(); k = i; // value read is 0 (not predicted) Optimization: Trace select reads • Observation: Hardware caches eliminate most off-chip reads • Optimize logging: • Logger and replayer simulate identical cache memories • Simple cache (the memory copy structure) to decide which values to log. No tags or valid bits to check. If the values mismatch they are logged. • Average trace size is <1 bit per instruction • The only read not predicted and logged follows the system call
Example Overhead • PinSEL and PinPLAY • Initial work (2006) with single threaded programs: • SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining) • Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL • Now: Geo-mean SPEC2006 • Pin 1.4x • Logger 83.6x • Replayer 1.4x
Example: Microsoft iDNA Trace Writer Performance • Memchecker and valgrind are in 30-40x range on CPU 2006 • iDNA ~11x, (does not log shared-memory dependences explicitly) • Use a sequential number for every lock prefixed memory operation: offline data race analysis
Logging Shared Memory Ordering • Emulation of Directory Based Cache Coherence • Identifies RAW, WAR, WAW dependences • Indexed by hashing effective address • Each entry represents an address range Directory Dir Entry Dir Entry Store A Program execution Dir Entry hash Load B Dir Entry
Directory Entries • Every DirEntry maintains: • Thread id of the last_writer • A timestamp is the # of memory ref. the thread has executed • Vector of timestamps of last access for each thread to that entry • On Loads: update the timestamp for the thread in the entry • On Stores: update the timestamp and the last_writer fields Directory Thread T1 Thread T2 DirEntry: [A:D] T2 Last writer id: T1 1: Store A 1: Load F 1 T1: 2 T2: 2 Vector 2: Store A Program execution DirEntry: [E:H] 2: Load A 3: Load F Last writer id: T1 3: Store F 3 1 3 T1: T2:
Detecting Dependences • RAW dependency between threads T and T’ is established if: • T executes a load that maps to the directory entry A • T’ is the last_writer for the same entry • WAW dependency between T and T’ is established if: • T executes a store that maps to the directory entry A • T’ is the last_writer for the same entry • WAR dependency between T and T’ is established if: • T executes a store that maps to the directory entry A • T’ has accessed the same entry in the past and T is not the last_writer
Example Thread T1 Thread T2 DirEntry: [A:D] T2 Last writer id: T1 1: Store A 1: Load F WAW 1 T1: 2 T2: 2 2: Store A Program execution DirEntry: [E:H] RAW 2: Load A 3: Load F Last writer id: T1 WAR 3: Store F 3 1 3 T1: T2: Last_writer Last access to the DirEntry Last access to the DirEntry SMO logs: Thread T2 cannot execute memory reference 2 until T1 executes its memory reference 1 T2 2 T1 1 T1 2 T2 2 T1 3 T2 3 Thread T1 cannot execute memory reference 2 until T2 executes its memory reference 2
Ordering Memory Accesses (Reducing log size) • Preserving order will reproduce execution • a→b: “a happens-before b” • Ordering is transitive: a→b, b→c means a→c • Two instructions must be ordered if: • they both access the same memory, and • one of them is a write
To guarantee a→d: a→d b→d a→c b→c Suppose we need b→c b→c is necessary a→d is redundant Constraints: Enforcing Order P1 P2 a overconstrained b c d
Problem Formulation Dependence (black) Conflicts (red) Thread I Thread J Thread I Thread J ld A add ld A add st B st B st C st C st C Log st C ld B ld B ld D ld D st A st A sub sub st C st C ld B ld B st D st D Recording Replay • Reproduce exact same conflicts: no more, no less
Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Conflicts Thread I Thread J • Detect conflicts Write log ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • Assign IC • (logical Timestamps) • But too many conflicts
TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay
New Reduced Log Log J: 23 45 Log I: 23 stricter Reduced Log Size: 48 bytes (6 integers) RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay 4% Overhead .2 MB/core/second logging (Apache)