Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

“Proactive Fault Tolerance for HPC with Xen Virtualization”Nagarajan, Mueller, Engelmann, ScottNCSU and Oak Ridge National Laboratory Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

Agenda • Background and motivations • Monitoring of system health • Role of Xen in Fault Tolerance • Migration process • Management of the FT mechanism • Experimental setup and results

Motivations • What is HPC? • Checkpoints used to save state of system and program execution. • Restarts are issued after a fault occurs and the failing component is removed or isolated. • Checkpointing adds overhead – “prolong a 100 hour job (without failure) by an additional 151 hours in petaflop systems”

Motivations cont. • Current techniques rely on reactive mechanisms • What if we could predict when a failure is going to occur? • Measure the health status of a system by monitoring status of fans, component temperatures, voltages and disk error logs. • Checkpoint frequencies are still necessary but eventually become the exception rather than the norm.

Proactive Fault Tolerance • System must provide the following constructs to carry out proactive fault tolerance • - Node health monitoring • - Failure prediction • - Load balancing • - Migration mechanism

Xen Review • Paravirtualization – Recall that the hosted VM must be modified to run on the VMM. Applications do not need to be modified • Xen provides facilities for live-migration • All state information is transferred before activation occurs on the target node • Preserves the state of all the processes on the guest.

Migration Process • Host VM inquires if target has sufficient resources for new guest, if so, it reserves them • Host VM sends all pages of the guest VM to the destination node. Sets a dirty bit in page table entry of the guest os. Subsequent writes to this cause traps. • Host VM starts sending dirty pages to the destination node. • Guest VM is finally stopped. The last pages are sent and the guest VM on the destination node begins execution. • What is the point of these steps? Why not just stop a guest OS, transfer it and restart it?

Load Balancing • Ganglia – a scalable distributed monitoring system. • Every node runs a daemon which monitors local resources. • Each node sends out multicast packets which contain current status information • Every node has a global view of the current state of the entire system. • Health information is not part of this mechanism. • Target node is selected if it does not yet host a guest VM and has the lowest CPU utilization

Health Monitoring • Intelligent Platform Management Interface – IPMI - provides a standardized message-based mechanism for monitoring and managing hardware • Baseboard Management Controller – BMC - contains sensors to monitor different system components and properties • Periodic sampling is accomplished by means of the theOpenIPMI API which communicates with the BMC.

Putting it all together – The PFT Daemon • The PFT (Proactive Fault Tolerance) Daemon centralizes and controls the main three components • Health monitoring • Decision making • Load Balancing • Initialization is lengthy and loads threshold values and specific parameters to monitor. • After initialization, it begins sampling various sensors via the BMC. • Comparisons with the thresholds are made and if any are exceeded, control is transferred to Ganglia which decides a target migration node. • The PFTd then issues the migration command which begins the live migration of the guest node from the current node to a more “healthy” one.

PFTd cont.

Experimental Setup • 16 node cluster each with 2GB main memory and two AMD Opteron-265 dual-core processors. All interconnected by a 1 Gbps Ethernet switch • NPB parallel benchmarks • How to simulate failures?

What is the experiment testing? • Recall that HPC clusters experience faults and checks have to be built into the overall system. This overhead reduces total performance • Measure wall clock time of system with and without failures • Measure performance for various scenarios • Single-Node failure – 4 nodes • Double-Node failure – 4 nodes • Scaling of the test system (i.e. #1 and #2 over a larger network and system) via 16 nodes

Initial test results

Multi-node tests • Measure performance as problem and network scales larger • Speedup measured with and without migration • One node failure for each test

Live Migration vs. Stop & Copy • Comparison of wall-clock execution time

Conclusions • Node failures reasonably predicted based on health stats • Avoidance of restarts • Problem sizes don’t necessarily increase the overhead of migration • Live migration overhead is larger than stop/copy scheme but, overall, faster because the application continues to execute during the migration • Live migration helps to hide the costs of relocating a guest OS and its associated MPI task • Reduce checkpoint frequency

Questions?

Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

Stephen Orchowski – 11/15/2008 CSE 520 – Advanced Computer Architecture

Presentation Transcript

Universal Mechanisms for Data-Parallel Architectures

Ongoing Computer Engineerin g Research Projects at the Lucian Blaga University of Sibiu

ECE 6160: Advanced Computer Networks Introduction to Storage Devices

SPA 561. Advanced Fluency Disorders

CSC 317 Computer Organization and Architecture

System Software and Machine Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE

Computer Architecture

DESIGN OF SOFTWARE ARCHITECTURE

CSC: 345 Computer Architecture

Conceptual Architecture View

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases

Advanced Computer Vision

Computer Organization and Architecture + Networks

Advanced Computer Architectures – HB49 –

Advanced Computer Architecture CSE 8383

CPE 323 Introduction to Embedded Computer Systems: The MSP430 System Architecture

CSE390 – Advanced Computer Networks

198:211 Computer Architecture

EEL 5764 Graduate Computer Architecture Chapter 2 - Instruction Level Parallelism

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines