180 likes | 308 Views
“Proactive Fault Tolerance for HPC with Xen Virtualization” Nagarajan , Mueller, Engelmann, Scott NCSU and Oak Ridge National Laboratory. Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture. Agenda. Background and motivations Monitoring of system health
E N D
“Proactive Fault Tolerance for HPC with Xen Virtualization”Nagarajan, Mueller, Engelmann, ScottNCSU and Oak Ridge National Laboratory Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture
Agenda • Background and motivations • Monitoring of system health • Role of Xen in Fault Tolerance • Migration process • Management of the FT mechanism • Experimental setup and results
Motivations • What is HPC? • Checkpoints used to save state of system and program execution. • Restarts are issued after a fault occurs and the failing component is removed or isolated. • Checkpointing adds overhead – “prolong a 100 hour job (without failure) by an additional 151 hours in petaflop systems”
Motivations cont. • Current techniques rely on reactive mechanisms • What if we could predict when a failure is going to occur? • Measure the health status of a system by monitoring status of fans, component temperatures, voltages and disk error logs. • Checkpoint frequencies are still necessary but eventually become the exception rather than the norm.
Proactive Fault Tolerance • System must provide the following constructs to carry out proactive fault tolerance • - Node health monitoring • - Failure prediction • - Load balancing • - Migration mechanism
Xen Review • Paravirtualization – Recall that the hosted VM must be modified to run on the VMM. Applications do not need to be modified • Xen provides facilities for live-migration • All state information is transferred before activation occurs on the target node • Preserves the state of all the processes on the guest.
Migration Process • Host VM inquires if target has sufficient resources for new guest, if so, it reserves them • Host VM sends all pages of the guest VM to the destination node. Sets a dirty bit in page table entry of the guest os. Subsequent writes to this cause traps. • Host VM starts sending dirty pages to the destination node. • Guest VM is finally stopped. The last pages are sent and the guest VM on the destination node begins execution. • What is the point of these steps? Why not just stop a guest OS, transfer it and restart it?
Load Balancing • Ganglia – a scalable distributed monitoring system. • Every node runs a daemon which monitors local resources. • Each node sends out multicast packets which contain current status information • Every node has a global view of the current state of the entire system. • Health information is not part of this mechanism. • Target node is selected if it does not yet host a guest VM and has the lowest CPU utilization
Health Monitoring • Intelligent Platform Management Interface – IPMI - provides a standardized message-based mechanism for monitoring and managing hardware • Baseboard Management Controller – BMC - contains sensors to monitor different system components and properties • Periodic sampling is accomplished by means of the theOpenIPMI API which communicates with the BMC.
Putting it all together – The PFT Daemon • The PFT (Proactive Fault Tolerance) Daemon centralizes and controls the main three components • Health monitoring • Decision making • Load Balancing • Initialization is lengthy and loads threshold values and specific parameters to monitor. • After initialization, it begins sampling various sensors via the BMC. • Comparisons with the thresholds are made and if any are exceeded, control is transferred to Ganglia which decides a target migration node. • The PFTd then issues the migration command which begins the live migration of the guest node from the current node to a more “healthy” one.
Experimental Setup • 16 node cluster each with 2GB main memory and two AMD Opteron-265 dual-core processors. All interconnected by a 1 Gbps Ethernet switch • NPB parallel benchmarks • How to simulate failures?
What is the experiment testing? • Recall that HPC clusters experience faults and checks have to be built into the overall system. This overhead reduces total performance • Measure wall clock time of system with and without failures • Measure performance for various scenarios • Single-Node failure – 4 nodes • Double-Node failure – 4 nodes • Scaling of the test system (i.e. #1 and #2 over a larger network and system) via 16 nodes
Multi-node tests • Measure performance as problem and network scales larger • Speedup measured with and without migration • One node failure for each test
Live Migration vs. Stop & Copy • Comparison of wall-clock execution time
Conclusions • Node failures reasonably predicted based on health stats • Avoidance of restarts • Problem sizes don’t necessarily increase the overhead of migration • Live migration overhead is larger than stop/copy scheme but, overall, faster because the application continues to execute during the migration • Live migration helps to hide the costs of relocating a guest OS and its associated MPI task • Reduce checkpoint frequency