1 / 15

Co-designed Virtual Machines for Reliable Computer Systems

This research paper discusses the design and implementation of co-designed virtual machines for fault-tolerant and high-performance computing systems. It covers hardware and software aspects of the virtual machine monitor, processor micro-architecture hooks, memory system design, interconnection and I/O channels, and networking. The paper also explores the benefits and challenges of using replicated hardware resources to achieve fault tolerance and improve system throughput.

dmagee
Download Presentation

Co-designed Virtual Machines for Reliable Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Co-designed Virtual Machines for Reliable Computer Systems University of Wisconsin – Madison September, 2002

  2. Outline • Overview of VMFTC – Dual-Mode Execution Virtual Machine for Fault Tolerance Computing • Hardware: Processor Micro-Architecture Hooks • Hardware: Memory System Design • Hardware: Interconnection, I/O Channels and Disks and Networking • Software: Virtual Machine Monitor • VMFTC Simulation Design • Problems and Feedback needed vmftc

  3. Principles • Hardware Resource Replication is a must for both Fault Tolerance and Performance/ Throughput. Build simple HW hooks with Replicated hardware resources and then Export either a performance VM or FTC VM. • Maintain Conventional Architectural Interface between HW and SW. e.g. between HW and OS, Exploit COTS if possible. – Is this strategy problematic? vmftc

  4. Architecture - End Users’ Perspective: • Two modes of virtual machines running on the same hardware platform + hidden mini VMM software. • Performance Mode VM: Fully Architected COTS processor resources, wider interconnection bandwidth, larger memory, wider I/O channel bandwidth, larger or more disks, and finally lower latency. • Reliable Mode VM: Ultra-Reliable Architected processor resources, ultra reliable and available interconnections, memory systems, I/O channels and storage. • Positive synergetic effects: Self-monitoring, Self healing via Error detection and recovery mechanisms in Reliable mode VM. Higher system throughput to alleviate workload pressures on the whole server system via performance mode virtual machine. vmftc

  5. Architecture - End Users’ Perspective: vmftc

  6. Proposed Contributions: • Flexible and cost effective usage of replicated COTS hardware resources via virtual machine technology to maintain conventional Architecture. • Ultra-Reliable architectural interface to software community – separate hardware RAS from that of software for easier hierarchical solutions. • Provide simple architectural support for software RAS mechanisms when needed for more effective whole system solutions. vmftc

  7. Processor Micro-Architecture Hooks • Performance Mode VM : More architected processing capacity. • Reliable Mode VM : RAS promised with lock stepped processor pairs. Can monitor system runtime hardware status. • Dynamic Switching between the two modes. • Bootstrap – Reliable mode for better self testing. • Power-off – VMM gets final control of the system, enter reliable mode to make sure everything is OK before power off. vmftc

  8. Processors: Lock-stepped UP or SMP/CMP vmftc

  9. Memory System Design • Performance Mode VM: More architected memory and interconnection bandwidth, slightly lower latency. • Reliable Mode VM: Less but more reliable and available memory system. Exploit Log Bit/Parity bits for each memory block to perform memory transaction logging. • Storage and communications are protected by ECC code • Optional mirrored memory images • All logic modules such as cache coherence processors, could also exploit dual modes processing. • Dynamic Switching between the two modes. vmftc

  10. Memory System Design vmftc

  11. Interconnection & I/O Channels • Dual-mode I/O controllers, channels. Multiple interconnections for both availability and performance. • Performance Mode VM: More architected resource capacities due to physical resource replication. • Reliable Mode VM: Less capacity for both communication bandwidth and storage due to controller cross-checking or checkpointing / logging overhead for hidden VMM activity. However, it monitors component fault rate for fault forecasting. • Dynamic Switching between the two modes. vmftc

  12. I/O & Interconnection Network vmftc

  13. VMM issues • Dynamic Configuration/Switching of the VMs • VMM Intercepts certain interrupts in reliable mode: Timer, Machine Check Interrupts, I/O Interrupts. • Timer triggered checkpointing. Checkpoint state: Processor state, cache state and memory state. Communication states via QS • Memory Checkpoints & Memory transaction Logging in main memory storage – Log bits to Reduce work • I/O event logging for replay during Recovery. • Rollback Recovery: Rollback memory image, Reload system state, I/O event replay… vmftc

  14. VMFTC Simulation Design • Simulator Infrastructure -- PHARMsim: SimOS-PPC + SimpleMP. Precise but slow. • Fault Injection • Fault Detection • Execution mode switch in the simulator. • Checkpointing/logging and recovery with full consideration of precise I/O event handling in the PHARMsim simulator. • Co-designed VMM  Classical OS VMM vmftc

  15. Problems • Potential Applications -- Servers? PC/workstations? Mobile computing? Embedded systems? • Whole System Level Fault Models – What are common faults and their frequency, cost etc. • Cost Models in building those hooks inside the whole system. Cost for redundant resources. • How to Evaluate fault tolerant computing, how to perform evaluation for a research project? • Anything HW can help to recover from SW Heisenberg faults? Or anything HW can do to help SW fault tolerance in a co-designed style? vmftc

More Related