330 likes | 464 Views
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT), OSDI’08. Shimin Chen LBA Reading Group. What is R2?. Library-based record & replay Intercept calls, record in log, replay from log
E N D
R2: An application-level kernel for record and replayZ. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT), OSDI’08 Shimin Chen LBA Reading Group
What is R2? • Library-based record & replay • Intercept calls, record in log, replay from log • Novel features: • Allow users (app developers) to decide which interface to do the record and replay • A set of annotations for the interface calls • Implementation: Windows • Supports Win32, MPI, and SQLite API
Outline • Introduction • Design overview • Execution orders • Annotations for optimization • Implementation • Evaluation • Summary
Choosing an Interface for Record & Replay • Must choose a “cut” in the call graph Above interface: executed during record and during replay Below interface: executed during record. Replayed from log.
Isolation Rule • RULE 1 (ISOLATION) All instances of unrecorded reads and writes to a variable should be either be below or above the interposed interface. • Isolate variables above the interface and variables below the interface • Can hold for Windows • For example, as long as R2 intercepts the complete set of file functions, file descriptors can be recorded
Non-Determinism Rule • Any source of non-determinism should be below the interposed interface. • Sources of non-determinism: • Calls that receive external data • Shared memory inter-process communications • Shared variables by multiple threads • R2 can handle 1 • For 2 and 3, must choose higher-level interface for hiding the effects (e.g., lock and unlock for spinlocks)
Terminology! interposed interface R2 records the output of R2 syscalls, the input of R2 upcalls, and their ordering
Execution Control • R2 tracks the state of every thread with a replay/system mode bit • Mode bit is updated when crossing the interface • Recording is avoided for R2 syscalls made from R2 system space • When a user invokes R2 with an application • R2’s initial state is in system space • The “main” is treated as an upcall (recorded, going into replay space)
Memory Management • R2 ensures the following in the replay space • malloc/free return the same address • R2 replay space uses a dedicated memory pool • Stack locations are the same • R2 replay space uses a separate stack per thread • R2 system space uses different stacks • R2 syscalls, e.g., getcwd(NULL,0), return memory buffers at the same locations • Returned buffer is copied to space allocated from the replay pool
Annotation and Code Generation • Developers annotate interface calls. Then R2 can automatically generate stub code for record and replay. • Direction: in/out • Buffer: bsize(return) • The buffer will be recorded. This example is simple. • If a C++ object is to be recorded, serialization & deserialization should be provided via operator overloading on streams
Annotation for Asynchronous Operation Start asynchronous file read Key to identity the call Call back prepareindicates that ReadFileEx issues an asynchronous I/O request keyed by lpOverlapped; commit indicates the request keyed by lpOverlapped is completed and the transferred data size is cbTransferred.
Outline • Introduction • Design overview • Execution orders • Annotations for optimization • Implementation • Evaluation • Summary
How to track execution orders? • Tracking causality • R2 syscall – R2 upcall causality: callback • See previous example • R2 syscall – R2 syscall causality: sync(key)
Recording Event Order (Lamport Clock) • Thread t’s clock c(t); event e’s clock c(e)
Replaying Event Order • Total-order recording + total-order replaying • Use a token to serialize execution • Causal-order recording + total-order replaying • Before replay, generate a total order based on the causal order recorded
Outline • Introduction • Design overview • Data transfers • Execution orders • Annotations for optimization • Implementation • Evaluation • Summary
Reducing Log Size for Frequent Calls • Some calls (e.g., GetLastError on Windows returns 0 in most cases) • “cache” annotation • R2 will cache the last return value • R2 will avoid recording the return value for subsequent calls until there is a change
Reproduce annotation • Some data can be reproduced at replay time without recording • For example, read file data from local disk • Can annotate with “reproduce” • R2 will execute the call during replay
Outline • Introduction • Design overview • Data transfers • Execution orders • Defining your own syscalls • Annotations for optimization • Implementation • Evaluation • Summary
Detecting Un-recorded Non-Determinism • R2 records R2 syscall signature (e.g., name) and checks it during replay • Detect mismatch and report
Outline • Introduction • Design overview • Data transfers • Execution orders • Defining your own syscalls • Annotations for optimization • Implementation • Evaluation • Summary
Questions to be Answered: • How much effort is required to annotate the syscall/upcall interface? • How important are annotations to successful replay of applications? • How much does R2 slowdown applications during recording? • How effective are custom syscall layers and annotations (cache and reproduce) in reducing log size and optimizing performance? • Replay is not evaluated: • “However, the replayed application without any debugging interaction runs much faster than when recording (e.g., a replay run of BitTorrent file downloading is 13x faster).”
Experimental Setup • All machines: • 2.0 GHz Xeon dual-core CPU, • 4 GB memory • two 250 GB, 7200 rpm disks • running Windows Server 2003 Service Pack 2 • interconnected via a 1 Gbps switch. • Unless explicitly specified: • the application data and R2 log files are kept on the same disk • total-order recording & execution • all optimizations (i.e., cache and reproduce) are turned off.
Annotation Effort • The paper says: • 500+ Win32 syscall interface: one person-week • MPI and SQLite: each take two person-days
Performance without optimization Apache is configured with 250 threads. ApacheBench mimics 50 concurrent client, downloading 64KB sized web pages. Each configuration executes 500,000 requests.
Customized R2 Syscall Layers Query: compute vertex degrees in a social network: SELECT COUNT(*) FROM edge GROUP BY src_uid; The data set is ~3MB large. FILE / MEM chooses where SQLite stores temporary data
Cache Annotation for Apache Profiling shows that 5 R2 syscalls contribute > 50% of syscalls Using cache annotation reduces the log size from 21.99MB to 18.1MB.
Reproduced File I/O (BitTorrent) 1 machine seeds a 4GB file, upload bandwidth is limited to 8MB/s. 10 machines download the file concurrently. Average log size is reduced from 17.1GB to 5.4GB by reproduce.
Reproduced Network I/O GE and PU are two MPI benchmarks. Annotated MPI functions using reproduce annotation so that the messages are not recorded but reproduced during replay.
Summary • Library based record and replay in software • Annotation and automatic generation of stub code for record and replay • Impressively support many Win32 applications • But cannot handle un-recorded non-determinism • e.g., data races in the replay space