Co-designed Virtual Machines for Reliable Computer Systems

Co-designed Virtual Machines for Reliable Computer Systems University of Wisconsin – Madison September, 2002

Outline • Overview of VMFTC – Dual-Mode Execution Virtual Machine for Fault Tolerance Computing • Hardware: Processor Micro-Architecture Hooks • Hardware: Memory System Design • Hardware: Interconnection, I/O Channels and Disks and Networking • Software: Virtual Machine Monitor • VMFTC Simulation Design • Problems and Feedback needed vmftc

Principles • Hardware Resource Replication is a must for both Fault Tolerance and Performance/ Throughput. Build simple HW hooks with Replicated hardware resources and then Export either a performance VM or FTC VM. • Maintain Conventional Architectural Interface between HW and SW. e.g. between HW and OS, Exploit COTS if possible. – Is this strategy problematic? vmftc

Architecture - End Users’ Perspective: • Two modes of virtual machines running on the same hardware platform + hidden mini VMM software. • Performance Mode VM: Fully Architected COTS processor resources, wider interconnection bandwidth, larger memory, wider I/O channel bandwidth, larger or more disks, and finally lower latency. • Reliable Mode VM: Ultra-Reliable Architected processor resources, ultra reliable and available interconnections, memory systems, I/O channels and storage. • Positive synergetic effects: Self-monitoring, Self healing via Error detection and recovery mechanisms in Reliable mode VM. Higher system throughput to alleviate workload pressures on the whole server system via performance mode virtual machine. vmftc

Architecture - End Users’ Perspective: vmftc

Proposed Contributions: • Flexible and cost effective usage of replicated COTS hardware resources via virtual machine technology to maintain conventional Architecture. • Ultra-Reliable architectural interface to software community – separate hardware RAS from that of software for easier hierarchical solutions. • Provide simple architectural support for software RAS mechanisms when needed for more effective whole system solutions. vmftc

Processor Micro-Architecture Hooks • Performance Mode VM : More architected processing capacity. • Reliable Mode VM : RAS promised with lock stepped processor pairs. Can monitor system runtime hardware status. • Dynamic Switching between the two modes. • Bootstrap – Reliable mode for better self testing. • Power-off – VMM gets final control of the system, enter reliable mode to make sure everything is OK before power off. vmftc

Processors: Lock-stepped UP or SMP/CMP vmftc

Memory System Design • Performance Mode VM: More architected memory and interconnection bandwidth, slightly lower latency. • Reliable Mode VM: Less but more reliable and available memory system. Exploit Log Bit/Parity bits for each memory block to perform memory transaction logging. • Storage and communications are protected by ECC code • Optional mirrored memory images • All logic modules such as cache coherence processors, could also exploit dual modes processing. • Dynamic Switching between the two modes. vmftc

Memory System Design vmftc

Interconnection & I/O Channels • Dual-mode I/O controllers, channels. Multiple interconnections for both availability and performance. • Performance Mode VM: More architected resource capacities due to physical resource replication. • Reliable Mode VM: Less capacity for both communication bandwidth and storage due to controller cross-checking or checkpointing / logging overhead for hidden VMM activity. However, it monitors component fault rate for fault forecasting. • Dynamic Switching between the two modes. vmftc

I/O & Interconnection Network vmftc

VMM issues • Dynamic Configuration/Switching of the VMs • VMM Intercepts certain interrupts in reliable mode: Timer, Machine Check Interrupts, I/O Interrupts. • Timer triggered checkpointing. Checkpoint state: Processor state, cache state and memory state. Communication states via QS • Memory Checkpoints & Memory transaction Logging in main memory storage – Log bits to Reduce work • I/O event logging for replay during Recovery. • Rollback Recovery: Rollback memory image, Reload system state, I/O event replay… vmftc

VMFTC Simulation Design • Simulator Infrastructure -- PHARMsim: SimOS-PPC + SimpleMP. Precise but slow. • Fault Injection • Fault Detection • Execution mode switch in the simulator. • Checkpointing/logging and recovery with full consideration of precise I/O event handling in the PHARMsim simulator. • Co-designed VMM  Classical OS VMM vmftc

Problems • Potential Applications -- Servers? PC/workstations? Mobile computing? Embedded systems? • Whole System Level Fault Models – What are common faults and their frequency, cost etc. • Cost Models in building those hooks inside the whole system. Cost for redundant resources. • How to Evaluate fault tolerant computing, how to perform evaluation for a research project? • Anything HW can help to recover from SW Heisenberg faults? Or anything HW can do to help SW fault tolerance in a co-designed style? vmftc

Co-designed Virtual Machines for Reliable Computer Systems

Co-designed Virtual Machines for Reliable Computer Systems

Presentation Transcript

Virtual Machines

Virtual Devices for Virtual Machines

Virtual Machines (Introduction to Virtual Machines)

Operating Systems and Virtual Machines Security

Virtual Machines

Virtual machines

Virtual Machines

Virtual Machines

Virtual Machines

Virtual Machines

Operating Systems Engineering Virtual Machines

Virtual Machines

Virtual Machines

Virtual Machines

Operating Systems Engineering Virtual Machines

Virtual Machines

Virtual machines

Virtual Machines

Virtual Machines

VIRTUAL MACHINES