60 likes | 199 Views
ExtraVirt: Detecting and recovering from transient processor faults. Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan. Flips Happen. Similar die area + Decreasing transition energy = Increasing risk of transient failure. Multi-Processors & Virtual Machine.
E N D
ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan
Flips Happen Similar die area + Decreasing transition energy = Increasing risk of transient failure
Multi-Processors &Virtual Machine • Multi-Processor • Ensure error independence • Enable fault detection • Efficient resource sharing • Virtual Machine • No changes to OS or applications • VM replay • Synchronize replicas • Recover correct state Replica 1 Replica 2 Device Drivers Hypervisor Replication Management Layer (RML)
Example: Memory Verify • Copy on write • Reduces overhead • Protects checkpoints • Merge on checkpoint • Verify correctness • Re-execute on deviation • Memory Fault Protection • ECC against RAM faults • MMU against CPU faults Checkpoint Memory Replica 1 Replica 2 Checkpoint E E E D X C C C B B B A A A Replica 3 E D C B A
Status • Present • VM Replay • Beginnings of Replication Management Layer (RML) • Still much to do… • Future • Replicate the un-replicated • Handle faults in device drivers • Expanded fault model Replica 1 Replica 2 Device Drivers Hypervisor/RML