150 likes | 160 Views
This research paper discusses the design and implementation of co-designed virtual machines for fault-tolerant and high-performance computing systems. It covers hardware and software aspects of the virtual machine monitor, processor micro-architecture hooks, memory system design, interconnection and I/O channels, and networking. The paper also explores the benefits and challenges of using replicated hardware resources to achieve fault tolerance and improve system throughput.
E N D
Co-designed Virtual Machines for Reliable Computer Systems University of Wisconsin – Madison September, 2002
Outline • Overview of VMFTC – Dual-Mode Execution Virtual Machine for Fault Tolerance Computing • Hardware: Processor Micro-Architecture Hooks • Hardware: Memory System Design • Hardware: Interconnection, I/O Channels and Disks and Networking • Software: Virtual Machine Monitor • VMFTC Simulation Design • Problems and Feedback needed vmftc
Principles • Hardware Resource Replication is a must for both Fault Tolerance and Performance/ Throughput. Build simple HW hooks with Replicated hardware resources and then Export either a performance VM or FTC VM. • Maintain Conventional Architectural Interface between HW and SW. e.g. between HW and OS, Exploit COTS if possible. – Is this strategy problematic? vmftc
Architecture - End Users’ Perspective: • Two modes of virtual machines running on the same hardware platform + hidden mini VMM software. • Performance Mode VM: Fully Architected COTS processor resources, wider interconnection bandwidth, larger memory, wider I/O channel bandwidth, larger or more disks, and finally lower latency. • Reliable Mode VM: Ultra-Reliable Architected processor resources, ultra reliable and available interconnections, memory systems, I/O channels and storage. • Positive synergetic effects: Self-monitoring, Self healing via Error detection and recovery mechanisms in Reliable mode VM. Higher system throughput to alleviate workload pressures on the whole server system via performance mode virtual machine. vmftc
Proposed Contributions: • Flexible and cost effective usage of replicated COTS hardware resources via virtual machine technology to maintain conventional Architecture. • Ultra-Reliable architectural interface to software community – separate hardware RAS from that of software for easier hierarchical solutions. • Provide simple architectural support for software RAS mechanisms when needed for more effective whole system solutions. vmftc
Processor Micro-Architecture Hooks • Performance Mode VM : More architected processing capacity. • Reliable Mode VM : RAS promised with lock stepped processor pairs. Can monitor system runtime hardware status. • Dynamic Switching between the two modes. • Bootstrap – Reliable mode for better self testing. • Power-off – VMM gets final control of the system, enter reliable mode to make sure everything is OK before power off. vmftc
Memory System Design • Performance Mode VM: More architected memory and interconnection bandwidth, slightly lower latency. • Reliable Mode VM: Less but more reliable and available memory system. Exploit Log Bit/Parity bits for each memory block to perform memory transaction logging. • Storage and communications are protected by ECC code • Optional mirrored memory images • All logic modules such as cache coherence processors, could also exploit dual modes processing. • Dynamic Switching between the two modes. vmftc
Memory System Design vmftc
Interconnection & I/O Channels • Dual-mode I/O controllers, channels. Multiple interconnections for both availability and performance. • Performance Mode VM: More architected resource capacities due to physical resource replication. • Reliable Mode VM: Less capacity for both communication bandwidth and storage due to controller cross-checking or checkpointing / logging overhead for hidden VMM activity. However, it monitors component fault rate for fault forecasting. • Dynamic Switching between the two modes. vmftc
VMM issues • Dynamic Configuration/Switching of the VMs • VMM Intercepts certain interrupts in reliable mode: Timer, Machine Check Interrupts, I/O Interrupts. • Timer triggered checkpointing. Checkpoint state: Processor state, cache state and memory state. Communication states via QS • Memory Checkpoints & Memory transaction Logging in main memory storage – Log bits to Reduce work • I/O event logging for replay during Recovery. • Rollback Recovery: Rollback memory image, Reload system state, I/O event replay… vmftc
VMFTC Simulation Design • Simulator Infrastructure -- PHARMsim: SimOS-PPC + SimpleMP. Precise but slow. • Fault Injection • Fault Detection • Execution mode switch in the simulator. • Checkpointing/logging and recovery with full consideration of precise I/O event handling in the PHARMsim simulator. • Co-designed VMM Classical OS VMM vmftc
Problems • Potential Applications -- Servers? PC/workstations? Mobile computing? Embedded systems? • Whole System Level Fault Models – What are common faults and their frequency, cost etc. • Cost Models in building those hooks inside the whole system. Cost for redundant resources. • How to Evaluate fault tolerant computing, how to perform evaluation for a research project? • Anything HW can help to recover from SW Heisenberg faults? Or anything HW can do to help SW fault tolerance in a co-designed style? vmftc