1 / 26

Selective Recovery From Failures In A Task Parallel Programming Model

Selective Recovery From Failures In A Task Parallel Programming Model. James Dinan*, Sriram Krishnamoorthy # , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory. Faults at Scale. Future systems built with large number of components

ankti
Download Presentation

Selective Recovery From Failures In A Task Parallel Programming Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy# , Arjun Singri*, P. Sadayappan* *The Ohio State University # Pacific Northwest National Laboratory

  2. Faults at Scale • Future systems built with large number of components • MTBF inversely proportional to #components • Faults will be frequent • Checkpoint-restart too expensive with numerous faults • Strain on system components, notably file system • Assumption of fault-free operation infeasible • Applications need to think about faults

  3. Programming Models • SPMD ties computation to a process • Fixed machine model • Applications needs to change with major architectural shifts • Fault handling involves non-local design changes • Rely on p processes: what if one goes away? • Message-passing makes it harder • Consistent cuts are challenging • Message logging, etc. expensive • Fault management requires lot of user involvement

  4. Problem Statement • Fault management framework • Minimize user effort • Components • Data state • Application data • Communication operations • Control state • What work is each process doing? • Continue to completion despite faults

  5. Approach • One-sided communication model • Easy to derive consistent cuts • Task parallel control model • Computation decoupled from processes • User specifies computation • Collection of tasks on global data • Runtime schedules computation • Load balancing • Fault management

  6. Global Arrays (GA) Proc0 Proc1 Procn X[M][M][N] Shared Global address space X[1..9] [1..9][1..9] Private X • PGAS Family: UPC (C), CAF (Fortran), Titanium (Java), GA (library) • Aggregate memory from multiple nodes into global address space • Data access via one-sided get(..), put(..), acc(..) operations • Programmer controls data distribution and locality • Fully inter-operable with MPI and ARMCI • Support for higher-level collectives – DGEMM, etc. • Widely used – chemistry, sub-surface transport, bioinformatics, CFD

  7. GA Memory Model • Remote memory access • Dominant communication in GA programs • Destination known in advance • No receive operation or tag matching • Remote Progress • Ensure overlap • Atomics and collectives • Blocking • Few outstanding at any time

  8. Saving Data State • Data State = Commn state + memory state • Communication state • “Flush” pending RMA operations (single call) • Save atomic and collective ops (small state) • Memory state • Force other processes to flush their pending ops • Used in virtualized execution of GA apps (Comp. Frontiers’09) • Also enables pre-emptive migration

  9. The Asynchronous Gap The PGAS memory model simplifies managing data Computation model is still regular, process-centric SPMD Irregularity in the data can lead toload imbalance Extend PGAS model to bridge asynchronous gap Dynamic, irregular view of the computation Runtime system should perform load balancing Allow for computation movement to exploit locality X[M][M][N] X[1..9] [1..9][1..9] X get(…)

  10. Control State – Task Model SPMD Task Parallel Termination SPMD • Express computation as collection of tasks • Tasks operate on data stored in Global Arrays • Executed in collective task parallel phases • Runtime system manages task execution

  11. Task Model Inputs: Global data, Immediates, CLOs Outputs: Global data, CLOs, Child tasks Strict dependence: Only parent → child (for now) Partitioned Global Address Space Proc0 Proc1 Procn Task: Y[0] Y[1] Y[N] Shared In: 5, Y[0], ... X[0] X[1] X[N] f(...) CLO1 CLO1 CLO1 Private Out: X[1]

  12. Scioto Programming Interface High level interface: shared global task collection Low level interface: set of distributed task queues Queues are prioritized by affinity Use work first principle (LIFO) Load balancing via work stealing (FIFO)

  13. Work Stealing Runtime System ARMCI task queue on each processor Steals don’t interrupt remote process When a process runs out of work Select a victim at random and steal work from them Scaled to 8192 cores (SC’09)

  14. Communication Markers • Communication initiated by a failed process • Handling partial completions • Get(), Put() are idempotent – ignore • Acc() non-idempotent • Mark beginning and end of acc() ops • Overhead • Memory usage – proportional to # tasks • Communication – additional small messages

  15. Fault Tolerant Task Pool Re-execute incomplete tasks till a round without failures

  16. Task Execution Update result only if it has not already been modified

  17. Detecting Incomplete Commn • Data with ‘started’ set but not ‘contributed’ • Approach 1: “Naïve” scheme • Check all markers for any that remain `started’ • Not scalable • Approach 2: “Home-based” scheme • Invert the task-to-data mapping • Distributed meta-data check + all-to-all

  18. Algorithm Characteristics • Tolerance to arbitrary number of failures • Low overhead in absence of failures • Small messages for markers • Can we optimized through pre-issue/speculation • Space overhead proportional to task pool size • Storage for markers • Recovery cost proportional to #failures • Redo work to produce data in failed processes

  19. Bounding Cascading Failures • A process with “corrupted” data • Incomplete comm. from failed process • Marking it as failed -> cascade failures • A process with “corrupted” data • Flushes its communication; then recovers its data • Each task computes only a few data blocks • Each process: pending comm. to few blocks at a time • Total recovery cost • Data in failed processes + a small additional number

  20. Experimental Setup • Linux cluster • Each node • Dual quad-core 2.5GHz opterons • 24GB RAM • Infiniband interconnection network • Self-Consistent Field (SCF) kernel – 48 Be atoms • Worst case fault – at the end of a task pool

  21. Cost of Failure – Strong Scaling #tasks re-executed goes down with increase in process count

  22. Worst Case Failure Cost

  23. Relative Performance Less than 10% cost for one worst case fault

  24. Related Work • Checkpoint restart • Continues to handle the SPMD portion of an app • Finer-grain recoverability using our approach • BOINC – client-server • CilkNOW– single assignment form • Linda – requires transactions • CHARM++ • processor virtualization based • Needs message logging • Efforts on fault tolerant runtimes • Complements this work

  25. Conclusions • Fault tolerance through • PGAS memory model • Task parallel computation model • Fine-grain recoverability through markers • Cost of failure proportional to #failures • Demonstrated low cost recovery for an SCF kernel

  26. Thank You!

More Related