1 / 20

Proactive Fault Tolerance Using Xen Virtualization

This solution addresses the problem of fault tolerance in high-performance computing (HPC) systems by implementing proactive measures. It anticipates node failures and migrates the operating system to a healthier node, preserving the application state with minimal overhead.

ckenny
Download Presentation

Proactive Fault Tolerance Using Xen Virtualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arun Babu NagarajanFrank MuellerNorth Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization

  2. Problem Statement • Trends in HPC: high end systems with thousands of processors • Increased probability of a node failure: MTBF becomes shorter • MPI widely accepted in scientific computing • Problem with MPI: no recovery from faults in the standard • Currently FT exist but… • only reactive: process checkpoint/restart • must restart entire job • inefficient if only one (few) node(s) fails • overhead due to redoing some of the work • issues: checkpoint at what frequency? • 100 hr job will run for addln 150 hrs on a petaflop machine (w/o failure) [I.philip, 2005]

  3. Our Solution • Proactive FT • anticipates node failure • takes preventive action instead of a ‘reaction’ to a failure • migrate the whole OS to a better physical node • entirely transparent to the application (rather to the OS itself) • hence avoids high overhead compared to reactive scheme (associated overhead w/ our scheme is very little )

  4. Design space • 1. A mechanism to predict/anticipate the failure of a node • OpenIPMI • lm_sensors (more system specific x86 Linux) • 2. A mechanism to identify the best target node • Custom centralized approaches – doesn’t scale + unreliable • Scalable distributed approach – Ganglia • 3. More importantly, a mechanism (for preventive action) which supports the relocation of the running application with • its state preserved • minimum overhead on the application itself • Xen Virtualisation with live migration support [C.Clark et al, May2005] • Open source

  5. Mechanisms explained • 1. Health Monitoring with OpenIPMI • Baseboard Mgmt Controller (BMC) equipped with sensors to monitor diff. properties like temperature, fan speed, voltage etc. of each node • IPMI (Intelligent Platform Management Interface) • increasingly common in HPC • std. message-based interface to monitor H/W • raw messaging harder to use and debug • OpenIPMI: open source, higher level abstraction from raw IPMI message-response system to communicate w/ BMC ( ie. to read sensors) • We use OpenIPMI to gather health information of nodes

  6. Mechanisms explained • 2. Ganglia • widely used, scalable distributed load monitoring tool • All the nodes in the cluster run a ganglia daemon and each node has a approximate view of the entire cluster • UDP used to transfer messages • Measures • cpu usage, mem usage, n/w usage by default • We use ganglia to identify least loaded node  migration target • Also extended to distribute IPMI sensor data

  7. MPI Task Guest VM Privileged VM Xen VMM Mechanisms explained • 3. Fault Tolerance w/ xen • para-virtualized environment • OS modified • application unchanged • Privileged VM & Guest VM runs on Xen hypervisor/ VMM • Guest VMs can live migrate to other hosts  little overhead • State of the VM preserved • VM halted for an insignificant period of time • Migration phases: • phase 1: send guest image  dst node, app running • phase 2: repeated diffs  dst node, app still running • phase 3: commit final diffs  dst node, OS/app frozen • phase 4: activate guest on dst, app running again H/w

  8. PFT Daemon PFT Daemon MPI Task MPI Task MPI Task Ganglia Ganglia Guest VM Guest VM Guest VM Privileged VM Privileged VM Xen VMM Xen VMM H/w BMC Overall set-up of the components • Stand-by Xen host, no guest PFT Daemon PFT Daemon BMC Baseboard Management Contoller Migrate Ganglia Ganglia Privileged VM Privileged VM • Deteriorating health  migrate guest (w/ MPI app) to stand-by host Xen VMM Xen VMM H/w BMC H/w BMC H/w BMC

  9. MPI Task Guest VM H/w BMC Overall set-up of the components • Stand-by Xen host, no guest PFT Daemon PFT Daemon BMC Baseboard Management Contoller Ganglia Ganglia Privileged VM Privileged VM Xen VMM Xen VMM • Deteriorating health  migrate guest (w/ MPI app) to stand-by host • The destination host generates unsolicited ARP reply advertising that Guest VM IP has moved to a new location [C.Clark et. Al 2005] - This will take care of peers to resend packets to the new host H/w BMC H/w BMC PFT Daemon PFT Daemon MPI Task MPI Task Ganglia Ganglia Guest VM Guest VM Privileged VM Privileged VM Xen VMM Xen VMM H/w BMC

  10. Runs on privileged VM (host) Initialize Read safe threshold from config file <Sensor name> <Low Thr> <Hi Thr> CPU temperature, fan speeds extensible (corrupt sectors, network, voltage fluctuations, …) Init connection w/ IPMI BMC using authentication parameters and hostname Gathers a listing of available sensors in the system and validates it against out list Proactive Fault Tolerance (PFT) Daemon PFT Daemon IPMI Baseboard Mgmt Controller Initialize Health Monitor Threshold Breach? N Y Load Balance Ganglia Raise Alarm / Maintenance of the system

  11. PFT Daemon • Health Monitoring • interacts w/ IPMI BMC (via OpenIPMI) to read sensors • Periodic sampling of data (event driven is also supported) • threshold exceeded  control handed over to load balancing • PFTd determines migration target by contacting Ganglia • Load-based selection (lowest load) • Load obtained by /proc file system • Invokes Xen live migration for guest VM • Xen user-land tools (at VM/host) • command line interface for live migration • PFT Daemon initiates migration for guest VM

  12. Experimental Framework • Cluster of 16 nodes (dual core, dual Opteron 265, 1 Gbps Ether) • Xen-3.0.2-3 VMM • Privileged and guest VM run ported Linux kernel version 2.6.16 • Guest VM: • Very same configuration as privileged VM • Has 1GB RAM • Booted on VMM w/ PXE netboot via NFS • Has access to NFS (same as the privileged VM) • Ganglia on Privileged VM (and also Guest VM) in all nodes • Node sensors obtained via OpenIPMI

  13. Experimental Framework • NAS Parallel Benchmarks run on Guest Virtual Machine • MPICH-2 w/ MPD ring on n GuestVMs (no job-pause required!) • Process on Privileged domain • monitors MPI task runs • issues migration command (NFS used for synchronization) • Measured: • wallclock time with and w/o migration • actual downtime + migration overhead (modified Xen migration) • benchmarks run 10 times, results report avg. • NPB V3.2.1: BT, CG, EP, LU and SP benchmarks • IS run is too short • MG requires > 1GB for class C

  14. Experimental Results 2. Double node failure 1. Single node failure NPB Class C / 4 nodes NPB Class B / 4 nodes • Single node failure – overhead of 1-4 % over total wall clock time • Double node failure - overhead of 2-8 % over total wall clock time

  15. Experimental Results 3. Behavior of Problem Scaling • Chart depicts only the overhead section • Dark region represents the part for which the VM was halted • The light region represents the delay incurred due to migration (diff operation.. Etc) NPB 4 nodes • Generally overhead increases with problem size (CG is exception )

  16. Experimental Results 4. Behavior of Task Scaling • Generally we expect a decrease in overhead on increasing the # of nodes • Some discrepancies for BT and LU observed (Migration duration is 40s but here we have 60s) NPB Class C

  17. Experimental Results 5. Migration duration NPB 4 nodes NPB 4/8/16 nodes • Min 13s needed to transfer a 1GB VM w/o any active processes • Max 40 seconds needed before migration is initiated • Depends on the n/w bandwidth, RAM size & on the application

  18. Experimental Results 6. Scalability (Total execution time) NPB Class C • Speedup is not very much affected

  19. Related Work • FT – Reactive approach is more common • Automatic • Checkpoint/restart (eg: BLCR – Berkeley Labs Checkpnt Restart)[S.Sankaran et.al LACSI ’03], [G.Stellner, IPPS ’ 96] • Log based (Log msg + temporal ordering) [G.Bosilica , Supercomputing, 2002] • Non-automatic • Explicit invocation of checkpoint routines [R.T.Aulwes et. Al, IPDPS 2004], [G. E. Fagg and J. J. Dongarra, 2000] • Virtualization in HPC is less/no overhead [W.Hunaf et al, ICS ’06] • To make virtualization competitive for MP environments, vmm-bypass I/o in VM has been experimented[J.Liu et.al USENIX ’06] • n/w virtualization can be optimized [A.Menon et.al USENIX ’06]

  20. Conclusion • In contrast to the currently available reactive FT schemes, we have come up with a proactive system with much less overhead • Transparent and automatic FT for arbitrary MPI applications • Ideally complements long running MPI jobs • Proactive system will complement reactive systems greatly. (It will help to reduce the high overhead associated with reactive schemes greatly)

More Related