260 likes | 415 Views
Selective Recovery From Failures In A Task Parallel Programming Model. James Dinan*, Sriram Krishnamoorthy # , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory. Faults at Scale. Future systems built with large number of components
E N D
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy# , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory
Faults at Scale • Future systems built with large number of components • MTBF inversely proportional to #components • Faults will be frequent • Checkpoint-restart too expensive with numerous faults • Strain on system components, notably file system • Assumption of fault-free operation infeasible • Applications need to think about faults
Programming Models • SPMD ties computation to a process • Fixed machine model • Applications needs to change with major architectural shifts • Fault handling involves non-local design changes • Rely on p processes: what if one goes away? • Message-passing makes it harder • Consistent cuts are challenging • Message logging, etc. expensive • Fault management requires lot of user involvement
Problem Statement • Fault management framework • Minimize user effort • Components • Data state • Application data • Communication operations • Control state • What work is each process doing? • Continue to completion despite faults
Approach • One-sided communication model • Easy to derive consistent cuts • Task parallel control model • Computation decoupled from processes • User specifies computation • Collection of tasks on global data • Runtime schedules computation • Load balancing • Fault management
Global Arrays (GA) Proc0 Proc1 Procn X[M][M][N] Shared Global address space X[1..9] [1..9][1..9] Private X • PGAS Family: UPC (C), CAF (Fortran), Titanium (Java), GA (library) • Aggregate memory from multiple nodes into global address space • Data access via one-sided get(..), put(..), acc(..) operations • Programmer controls data distribution and locality • Fully inter-operable with MPI and ARMCI • Support for higher-level collectives – DGEMM, etc. • Widely used – chemistry, sub-surface transport, bioinformatics, CFD
GA Memory Model • Remote memory access • Dominant communication in GA programs • Destination known in advance • No receive operation or tag matching • Remote Progress • Ensure overlap • Atomics and collectives • Blocking • Few outstanding at any time
Saving Data State • Data State = Commn state + memory state • Communication state • “Flush” pending RMA operations (single call) • Save atomic and collective ops (small state) • Memory state • Force other processes to flush their pending ops • Used in virtualized execution of GA apps (Comp. Frontiers’09) • Also enables pre-emptive migration
The Asynchronous Gap The PGAS memory model simplifies managing data Computation model is still regular, process-centric SPMD Irregularity in the data can lead toload imbalance Extend PGAS model to bridge asynchronous gap Dynamic, irregular view of the computation Runtime system should perform load balancing Allow for computation movement to exploit locality X[M][M][N] X[1..9] [1..9][1..9] X get(…)
Control State – Task Model SPMD Task Parallel Termination SPMD • Express computation as collection of tasks • Tasks operate on data stored in Global Arrays • Executed in collective task parallel phases • Runtime system manages task execution
Task Model Inputs: Global data, Immediates, CLOs Outputs: Global data, CLOs, Child tasks Strict dependence: Only parent → child (for now) Partitioned Global Address Space Proc0 Proc1 Procn Task: Y[0] Y[1] Y[N] Shared In: 5, Y[0], ... X[0] X[1] X[N] f(...) CLO1 CLO1 CLO1 Private Out: X[1]
Scioto Programming Interface High level interface: shared global task collection Low level interface: set of distributed task queues Queues are prioritized by affinity Use work first principle (LIFO) Load balancing via work stealing (FIFO)
Work Stealing Runtime System ARMCI task queue on each processor Steals don’t interrupt remote process When a process runs out of work Select a victim at random and steal work from them Scaled to 8192 cores (SC’09)
Communication Markers • Communication initiated by a failed process • Handling partial completions • Get(), Put() are idempotent – ignore • Acc() non-idempotent • Mark beginning and end of acc() ops • Overhead • Memory usage – proportional to # tasks • Communication – additional small messages
Fault Tolerant Task Pool Re-execute incomplete tasks till a round without failures
Task Execution Update result only if it has not already been modified
Detecting Incomplete Commn • Data with ‘started’ set but not ‘contributed’ • Approach 1: “Naïve” scheme • Check all markers for any that remain `started’ • Not scalable • Approach 2: “Home-based” scheme • Invert the task-to-data mapping • Distributed meta-data check + all-to-all
Algorithm Characteristics • Tolerance to arbitrary number of failures • Low overhead in absence of failures • Small messages for markers • Can we optimized through pre-issue/speculation • Space overhead proportional to task pool size • Storage for markers • Recovery cost proportional to #failures • Redo work to produce data in failed processes
Bounding Cascading Failures • A process with “corrupted” data • Incomplete comm. from failed process • Marking it as failed -> cascade failures • A process with “corrupted” data • Flushes its communication; then recovers its data • Each task computes only a few data blocks • Each process: pending comm. to few blocks at a time • Total recovery cost • Data in failed processes + a small additional number
Experimental Setup • Linux cluster • Each node • Dual quad-core 2.5GHz opterons • 24GB RAM • Infiniband interconnection network • Self-Consistent Field (SCF) kernel – 48 Be atoms • Worst case fault – at the end of a task pool
Cost of Failure – Strong Scaling #tasks re-executed goes down with increase in process count
Relative Performance Less than 10% cost for one worst case fault
Related Work • Checkpoint restart • Continues to handle the SPMD portion of an app • Finer-grain recoverability using our approach • BOINC – client-server • CilkNOW– single assignment form • Linda – requires transactions • CHARM++ • processor virtualization based • Needs message logging • Efforts on fault tolerant runtimes • Complements this work
Conclusions • Fault tolerance through • PGAS memory model • Task parallel computation model • Fine-grain recoverability through markers • Cost of failure proportional to #failures • Demonstrated low cost recovery for an SCF kernel