Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma.cs.uiuc.edu

Record-and-Replay (RnR) • Record execution of a parallel program or a whole machine • Save non-deterministic events in a log • During replay, use the recoded log to enforce the same execution • Each thread follows the same sequence of instructions • Use cases • Debugging • Security • High availability

Contribution: Cyrus RnR System • Application-level RnR • RnR one or more programs in isolation • What users typically need • Fast replay • Replay-time parallelism • Flexibly trade off parallelism for log size • Unintrusive HW • No changes to snoopy cache coherence protocol

Capturing Non-determinism • Sources of non-determinism • Program inputs • Memory access interleavings • How to capture? • OS kernel extension to capture program inputs • HW support to capture memory interleavings (HW-assisted RnR) • This talk: recording memory interleavings

Recording Interleaving as Chunks P0 P1 add ... store A ... mul P0 P1 add …. store A …. mul …. sub div …. load A …. add div …. load A …. add …. Req Time Resp …. sub Inter-processor data dependences manifest as coherence messages Capture interleavings as ordered chunks of instructions

Restriction: Unintrusive HW P0 P1 P0 P0 P1 P1 invl invl data data P1 rd P1 wr P1 wr RAW WAW WAR • Requirements for HW-assisted RnR: • Do not augment or add coherence messages • Do not rely on explicit replies Only source is always aware → Use source-only recording • Unmodified snoopy protocols • In some coherence transactions, there is no reply

Challenge 1: Enable Replay Parallelism • Key to fast replay • Overlapped replay of chunks from diff. threads • Previous work: • DAG-based ordering (Karma [ICS’2011]) • Requires explicit replies • Augments coherence messages

Challenge 1: Enable Replay Parallelism P0 P1 P0 P0 P1 P1 Successor Predecessor P0→P1 P0→P1 ?

Challenge 2: Application-Level RnR P0 P1 P2 (1) (3) (2) (4) Monitored Application Non-monitored Communication Turn hardware on only when a recorded application runs. Four cases: (1) src=monitoring, dst=monitoring (2) src=monitoring, dst=not monitoring (3) src=not monitoring, dst=monitoring (4) src=not monitoring, dst=not monitoring Issues of source-only recording: Cannot distinguish between (1) and (2) (2) may result in a dependence later Not recording in (3) and (4)

Challenge 2: Application-Level RnR P0 P1 P2 (1) ser ser (3) (2) (2) (4) Monitored Application Non-monitored Dependence • Treat (2) as an Early Dependence • Defer and assign it to the next chunk of the target processor • (3) and (4) superseded by context switches • At context switch, record a Serialization Dependence to all other processors

Key: On-the-Fly Backend Software Pass Source-only Log DAG of Chunks Recording Processors Replaying Processors On-the-fly Backend … … P P P P P P • Transforms source-only log to DAG (for parallelism) • Fixes the Early and Serialization dependences • To support app-level RnR • Can trade replay parallelism for log size

Memory Race Recording Unit (RRU) P Mem Refs Evictions Cache RRU Snoops Bus • HW module that observes coherence transactions and cache evictions • Tracks loads/stores of the chunk in a signature • Keeps signatures for multiple recent chunks • Records for each chunk • # of instructions • Timestamp (# of coh. transactions) • Dependences for which the chunk is source • Dumps recorded chunks into a log in memory

RRUs Record Source-Only Log P0 P1 P2 Chunk TS Successor Vector TimeStamp Rd A Wr B P0 C00 … P0 P1 P2 Rd B C00 100 - 150 100 … 100 … … C01 200 - 200 200 … Wr A 150 … … … … C01 … P1 … 300 200 250 Rd D C10 250 - - 250 … … C10 Wr D … … … … … P2 C20 … … C20 300 - - -

Backend Pass Creates DAG C00 C00 100 - 150 100 Chunks of P0 C01 200 - 200 200 C01 Chunks of P1 C10 250 - - 250 C10 Chunks of P2 C20 C20 300 - - - • Finds the target chunk for each recorded dependency • Creates bidirectional links between src and dst chunks • This algorithm is called MaxPar

Trading Replay Parallelism for Log Size C00 C00 + C01 Serial C00 + C01 C00 Stitched C01 C01 C10 C10 C10 C10 C20 C20 StSerial • No Parallelism • Smallest log C20 C20 • Less Parallelism • Smaller log • No Parallelism • Even Smaller log

Evaluation • Using Simics • Full-system simulation with OS • Wrote a Linux kernel module to • Records application inputs • Controls RRUs • Model 8 + 1 processors • 8 processors for the app • 1 processor for the backend • 10 SPLASH-2 benchmarks

Replay Time Normalized to Recording • Large difference between MaxPar and Serial replay • On 8 processors, unoptimizedMaxPar replay is only 50% slower than recording

Conclusions • Cyrus: RnR system that supports • Application-level RnR • Unintrusive hardware • Flexible replay parallelism • Key idea: On-the-fly software backend pass • On 8 processors: • Large difference between MaxPar and Serial replay • Unoptimized replay of MaxPar is only 50% slower than recording • Negligible recording overhead • Upcoming ISCA’13 paper describes our FPGA RnR prototype

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Presentation Transcript

Digital Interactive Audio Video Recorder

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay

Offline Untrusted Storage with Immediate Detection of Forking and Replay Attacks

Chapter 6 Multiprocessors and Thread-Level Parallelism

Recording Inter-Thread Data Dependencies for Deterministic Replay

Efficient Checkpointing of Java Software using Context-Sensitive Capture and Replay

Theta-Coupled Periodic Replay in Working Memory

PinPlay : A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs

Language-Based Replay via Data Flow Cut

SRTP Replay Protection

Replay

Vocabulary Instant Replay

Instant replay in baseball

Talk contents:

BDO SEIDMAN, LLP’S December 22, 2005 FINANCIAL REPORTING UPDATE

BDO SEIDMAN, LLP’S June 2007 FINANCIAL REPORTING UPDATE

Protocol-Independent Adaptive Replay of Application Dialog

Efficient Checkpointing of Java Software using Context-Sensitive Capture and Replay

EndlessYouTube : Repeat/Loop/Replay YouTube

IVB7 HD Webcasting Equipment

AUV Workbench : Integrated 3D for Interoperable Mission Rehearsal, Reality and Replay

Automatically Classifying Benign and Harmful Data Races Using Replay Analysis