1 / 18

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism. Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King ( UIUC ) Gilles Pokam and Cristiano Pereira ( Intel ). iacoma . cs .uiuc.edu. Record-and-Replay (RnR).

oria
Download Presentation

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma.cs.uiuc.edu

  2. Record-and-Replay (RnR) • Record execution of a parallel program or a whole machine • Save non-deterministic events in a log • During replay, use the recoded log to enforce the same execution • Each thread follows the same sequence of instructions • Use cases • Debugging • Security • High availability

  3. Contribution: Cyrus RnR System • Application-level RnR • RnR one or more programs in isolation • What users typically need • Fast replay • Replay-time parallelism • Flexibly trade off parallelism for log size • Unintrusive HW • No changes to snoopy cache coherence protocol

  4. Capturing Non-determinism • Sources of non-determinism • Program inputs • Memory access interleavings • How to capture? • OS kernel extension to capture program inputs • HW support to capture memory interleavings (HW-assisted RnR) • This talk: recording memory interleavings

  5. Recording Interleaving as Chunks P0 P1 add ... store A ... mul P0 P1 add …. store A …. mul …. sub div …. load A …. add div …. load A …. add …. Req Time Resp …. sub Inter-processor data dependences manifest as coherence messages Capture interleavings as ordered chunks of instructions

  6. Restriction: Unintrusive HW P0 P1 P0 P0 P1 P1 invl invl data data P1 rd P1 wr P1 wr RAW WAW WAR • Requirements for HW-assisted RnR: • Do not augment or add coherence messages • Do not rely on explicit replies Only source is always aware → Use source-only recording • Unmodified snoopy protocols • In some coherence transactions, there is no reply

  7. Challenge 1: Enable Replay Parallelism • Key to fast replay • Overlapped replay of chunks from diff. threads • Previous work: • DAG-based ordering (Karma [ICS’2011]) • Requires explicit replies • Augments coherence messages

  8. Challenge 1: Enable Replay Parallelism P0 P1 P0 P0 P1 P1 Successor Predecessor P0→P1 P0→P1 ?

  9. Challenge 2: Application-Level RnR P0 P1 P2 (1) (3) (2) (4) Monitored Application Non-monitored Communication Turn hardware on only when a recorded application runs. Four cases: (1) src=monitoring, dst=monitoring (2) src=monitoring, dst=not monitoring (3) src=not monitoring, dst=monitoring (4) src=not monitoring, dst=not monitoring Issues of source-only recording: Cannot distinguish between (1) and (2) (2) may result in a dependence later Not recording in (3) and (4)

  10. Challenge 2: Application-Level RnR P0 P1 P2 (1) ser ser (3) (2) (2) (4) Monitored Application Non-monitored Dependence • Treat (2) as an Early Dependence • Defer and assign it to the next chunk of the target processor • (3) and (4) superseded by context switches • At context switch, record a Serialization Dependence to all other processors

  11. Key: On-the-Fly Backend Software Pass Source-only Log DAG of Chunks Recording Processors Replaying Processors On-the-fly Backend … … P P P P P P • Transforms source-only log to DAG (for parallelism) • Fixes the Early and Serialization dependences • To support app-level RnR • Can trade replay parallelism for log size

  12. Memory Race Recording Unit (RRU) P Mem Refs Evictions Cache RRU Snoops Bus • HW module that observes coherence transactions and cache evictions • Tracks loads/stores of the chunk in a signature • Keeps signatures for multiple recent chunks • Records for each chunk • # of instructions • Timestamp (# of coh. transactions) • Dependences for which the chunk is source • Dumps recorded chunks into a log in memory

  13. RRUs Record Source-Only Log P0 P1 P2 Chunk TS Successor Vector TimeStamp Rd A Wr B P0 C00 … P0 P1 P2 Rd B C00 100 - 150 100 … 100 … … C01 200 - 200 200 … Wr A 150 … … … … C01 … P1 … 300 200 250 Rd D C10 250 - - 250 … … C10 Wr D … … … … … P2 C20 … … C20 300 - - -

  14. Backend Pass Creates DAG C00 C00 100 - 150 100 Chunks of P0 C01 200 - 200 200 C01 Chunks of P1 C10 250 - - 250 C10 Chunks of P2 C20 C20 300 - - - • Finds the target chunk for each recorded dependency • Creates bidirectional links between src and dst chunks • This algorithm is called MaxPar

  15. Trading Replay Parallelism for Log Size C00 C00 + C01 Serial C00 + C01 C00 Stitched C01 C01 C10 C10 C10 C10 C20 C20 StSerial • No Parallelism • Smallest log C20 C20 • Less Parallelism • Smaller log • No Parallelism • Even Smaller log

  16. Evaluation • Using Simics • Full-system simulation with OS • Wrote a Linux kernel module to • Records application inputs • Controls RRUs • Model 8 + 1 processors • 8 processors for the app • 1 processor for the backend • 10 SPLASH-2 benchmarks

  17. Replay Time Normalized to Recording • Large difference between MaxPar and Serial replay • On 8 processors, unoptimizedMaxPar replay is only 50% slower than recording

  18. Conclusions • Cyrus: RnR system that supports • Application-level RnR • Unintrusive hardware • Flexible replay parallelism • Key idea: On-the-fly software backend pass • On 8 processors: • Large difference between MaxPar and Serial replay • Unoptimized replay of MaxPar is only 50% slower than recording • Negligible recording overhead • Upcoming ISCA’13 paper describes our FPGA RnR prototype

More Related