Rebound: Scalable Checkpointing for Coherent Shared Memory

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

Checkpointing in Shared-Memory MPs rollback Fault save chkpt save chkpt P1 P2 P3 P4 checkpoint checkpoint • HW-based schemes for small CMPs use Global checkpointing • All procs participate in system-wide checkpoints • Global checkpointing is not scalable • Synchronization, bursty movement of data, loss in rollback…

Alternative: Coordinated Local Checkpointing P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Chkpt GlobalChkpt Local Chkpt + Scalable: Checkpoint and rollback in processor groups • Complexity: Record inter-thread dependences dynamically. • Idea: threads coordinate their checkpointing in groups • Rationale: • Faults propagate only through communication • Interleaving between non-comm. threads is irrelevant

Contributions Rebound:First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory • Leverages directory protocol to track inter-thread deps. • Opts to boost checkpointing efficiency: • Delaying write-back of data to safe memory at checkpoints • Supporting multiple checkpoints • Optimizing checkpointing at barrier synchronization • Avg. performance overhead for 64 procs: 2% • Compared to 15% for global checkpointing

Background: In-Memory Checkpt with ReVive • [Prvulovic-02] Execution Register Dump P2 P3 P1 CHK Displacement Caches Dirty Cache lines Writebacks W W W W WB Checkpoint Application Stalls Writeback old old Logging old Log Memory

Background: In-Memory Checkpt with ReVive • [Pvrulovic-02] Old Register restored P2 P3 P1 CHK Fault Caches Cache Invalidated W W W W WB Memory Lines Reverted Log Memory Local Coordinated Scalable protocol Global Broadcast protocol

Coordinated Local Checkpointing Rules P1 P2 P1 P1 P2 P2 wr x rd x • P checkpoints  P’s producers checkpoint • P rolls back  P’s consumers rollback chkpt chkpt Consumer rollback Consumer chkpoint Producer rollback Producer chkpoint • Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

Rebound Fault Model Chip Multiprocessor Main Memory Log (in SW) • Any part of the chip can suffer transient or permanent faults. • A fault can occur even during checkpointing • Off-chip memory and logs suffer no fault on their own (e.g. NVM) • Fault detection outside our scope: • Fault detection latency has upper-bound of L cycles

Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep Register MyConsumer L2 Directory Cache LW-ID

Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep Register MyConsumer L2 Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: • MyProducers: bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc. that consumed data produced by the local proc.

Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep Register MyConsumer L2 Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: • MyProducers: bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc. that consumed data produced by the local proc. • Processor ID in each directory entry: • LW-ID: last writer to the line in the current checkpoint interval.

Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 writes MyConsumers MyConsumers Write LW-ID D P1 Log Memory Assume MESI protocol

Recording Inter-Thread Dependences P1 MyConsumers P2 P2 MyProducers MyProducers P2 reads P1 MyConsumers MyConsumers P2 MyProducers P1 LW-ID D S P1 Write back Logging Log Memory Assume MESI protocol

Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 writes P1 MyConsumers MyConsumers P2 LW-ID S P1 P1 D Log Memory Assume MESI protocol

Recording Inter-Thread Dependences P1 Clear Depregisters P2 MyProducers MyProducers P1 checkpoints P1 MyConsumers MyConsumers P2 Clear LW-ID LW-ID shouldremain set tillthe line is checkpointed LW-ID S P1 Writebacks P1 D Logging Log Memory Assume MESI protocol

Lazily clearing Last Writers • Clear LW-IDs  Expensive process ! • Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval. • At checkpoint, the processors clear their Write Signature • Potentially stale LW-ID

Lazily clearing Last Writers P1 P2 NO ! MyProducers MyProducers P2 reads WSig MyConsumers MyConsumers Addr ? Clear LW-ID Stale LW-ID S P1 Log Memory

Distributed Checkpointing Protocol in SW InteractionSet : P1 P1 P2 P3 P4 P1 chk initiate checkpoint • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

Distributed Checkpointing Protocol in SW InteractionSet : P1 P1 P2 P3 P4 P1 chk Ck? Ck? P3 P2 initiate checkpoint • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

Distributed Checkpointing Protocol in SW InteractionSet : P1 , P2, P3 P1 P2 P3 P4 P1 Accept Accept chk Ck? Ck? P3 P2 Ck? initiate checkpoint P4 • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

Distributed Checkpointing Protocol in SW InteractionSet : P1 , P2, P3 P1 P2 P3 P4 P1 Accept Accept chk Ck? Ck? P3 P2 Ack Ck? Decline initiate checkpoint P4 • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

Distributed Checkpointing Protocol in SW InteractionSet : P1 , P2, P3 P1 P2 P3 P4 P1 Accept Accept chk Ck? Ck? P3 P2 Ack Ck? Decline initiate checkpoint P4 • Checkpointing is a 2-phase commit protocol. • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

Distributed Rollback Protocol in SW • Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers • Rollback involves • Clearing the Dep. Registers and Write Signature • Invalidating the processor caches • Restoring the data and register context from the logs up to the latest checkpoint. • No Domino Effect

Optimization1 : Delayed Writebacks Time Interval I1 Interval I1 Stall sync sync Stall Checkpoint WB dirty lines Interval I2 Checkpoint WB dirty lines sync sync Interval I2 • Checkpointing overhead dominated by data writebacks • Delayed Writeback optimization • Processors synchronize and resume execution • Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back • Still need to record inter-thread dependences on delayed data

Delayed Writeback Pros/Cons + Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit • Increased vulnerability A rollback event forces both intervals to roll back

Delayed Writeback protocol MyConsumers0 P2 P1 P2 MyProducers0 MyProducers0 YES ! WSig0 xxx MyConsumers0 MyConsumers0 P2 Addr ? P2 reads MyProducers1 MyProducers1 P1 WSig1 MyConsumers1 MyConsumers1 NO ! Addr ? MyProducers1 P1 LW-ID D S P1 Write back Logging Log Memory

Optimization2 : Multiple Checkpoints Dep registers 1 • Problem: Fault detection is not instantaneous • Checkpoint is safe only after max fault-detection latency (L) Rollback Dep registers 2 Detection Latency Ckpt 1 Fault Ckpt 2 tf • Solution: Keep multiple checkpoints • On fault, roll back interacting processors to safe checkpoints • No Domino Effect

Multiple Checkpoints: Pros/Cons + Realistic system: supports non-instantaneous fault detection - Additional support: Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency • - Need to track communication across checkpoints • - Combination with Delayed Writebacks: one more Depregister set

Optimization3 : Hiding Chkpt behind Global Barrier • Global barriers require that all processors communicate • Leads to global checkpoints • Optimization: • Proactively trigger a global checkpoint at a global barrier • Hide checkpoint overhead behind barrier imbalance spins

Hiding Checkpoint behind Global Barrier • Lock • count++ • if(count == numProc) • Iam_last = TRUE /*local var*/ • Unlock • If(I am_last) { • count = 0 • flag = TRUE … • } • else • while(!flag) {} Update

Hiding Checkpoint behind Global Barrier • Lock • count++ • if(count == numProc) • Iam_last = TRUE /*local var*/ • Unlock • If(I am_last) { • count = 0 • flag = TRUE … • } • else • while(!flag) {} Processor P3 Processor P1 Processor P2 Update Update Update BarCK? Update BarCK? Notify Notify flag = TRUE ICHK = {P3} while(!flag) ICHK = {P1, P3} while(!flag) ICHK = {P2, P3} First arriving processor initiates the checkpoint Others: HW writes back data as execution proceeds to barrier Commit checkpoint as last processor arrives After the barrier: few interacting processors

Evaluation Setup • Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim • Applications: SPLASH-2 , some PARSEC, Apache • Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms • Modeled several environments: • Global: baseline global checkpointing • Rebound: Local checkpointing scheme with delayed writeback. • Rebound_NoDWB: Rebound without the delayed writebacks.

Avg. Interaction Set: Set of Producer Processors 64 38 • Most apps: interaction set is a small set • Justifies coordinated local checkpointing • Averages brought up by global barriers

Checkpoint Execution Overhead 15 2 • Rebound’s avg checkpoint execution overhead is 2% • Compared to 15% for Global

Checkpoint Execution Overhead • Rebound’s avg checkpoint execution overhead is 2% • Compared to 15% for Global • Delayed Writebacks complement local checkpointing

Rebound Scalability Constant problem size Rebound is scalable in checkpoint overhead Delayed Writebacks help scalability

Also in the Paper Delayed write backs also useful in Global Barrier optimization is effective but not universally applicable Power increase due to hardware additions < 2% Rebound leads to only 4% increase in coherence traffic

Conclusions Rebound:First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory • Future work: • Apply Rebound to non-hardware coherent machines • Scalability to hierarchical directories • Leverages directory protocol • Boosts checkpointing efficiency: • Delayed write-backs • Multiple checkpoints • Barrier optimization • Avg. execution overhead for 64 procs: 2%

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

Rebound: Scalable Checkpointing for Coherent Shared Memory