Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

Approaches to Fault tolerance • Level: • Application-level, system-level, runtime-level • Role of compiler support? • Categorization by • Objectives: increase progress rate, eliminate failure • {Coordinated, uncordinated} x {user-selected time, any-time} • Rollback, redundancy, migrate to avoid… • Proactive vs Reactive • Is it important that the non-failed nodes don’t have to go back • Utility of message logging • I/O system design • How to exploit the special nature of checkpoint traffic • OS interaction/specialization • Page-size • I/O threads, SMT • How much forward path overhead are the apps folk willing to live with? • When/where is FT “needed”? Dagstuhl Fault Tolerance Workshop 9/1/2014 2

LEVEL • App PRO: • “Good” apps already do it (may need it anyway for short queue slots, viz/debug, etc…) • “minimal” data • Portable and multi-purpose (egviz/debug) • App CON: • Not preemptive in general. • Lib or module stacking a problem? • Return to end of queue (at least currently) • Sys PRO: • Preemptive • Transparent, including lack of sources • Sys CON: • “bounding box” of data • Non portable and single-purpose • Runtime PRO: • Preemptive, adapative • “stacking” problem may be less of an issue • Runtime CON: • Non-transparent if you want minimal data Dagstuhl Fault Tolerance Workshop

Objectives • HPC • App must make progress (produce a result) • End-to-end or “total time to solution” (efficiency) • Cost in terms of finite allocation • Distributed Systems • Must not “stop/fail” (degrade instead) • Grids [varying viewpoints] • Optimize Through-put • preserve the workflow, perhaps w/o any process-level FT • May use FT in presence of deadlines and/or “large” tasks • Realtime? (experimental data is non-reproducable) • P2P (volunteer) • Progress/service in presence of dynamic PE pool • Protect against loss/corruption of data Dagstuhl Fault Tolerance Workshop

Coordinated or not? • Historically we have thought coordinated was too expensive • We see data here that suggests that the complexity of uncoordinated is prohibitive • “Scalability” is a dirty word • We don’t all think it means the same thing • Uncoordinated may contribute more to “noise” (but background I/O is still “noise” in any scheme) Dagstuhl Fault Tolerance Workshop

Final comments/thoughts • FT via checkpoint/restart (rollback-recovery) • Avoid “all go back” (degrees) • Really a good thing if avoids the requeue (job pause) • If don’t even need job pause avoid waste of power,etc. • OS interaction/specialization • Sanjay concerned about large pages. BLCR can/will use 4kb for “differential” regardless of the “real” size. • Frank wants better “hooks” w.r.t. page tables, etc. • Higher reliability kernels == better? (at what cost) • People are still writing new kernels • Linux w/imicrokernels • Filesystem state (snapshots, message logging, VMs do block device snapshots, transactional fs, what else?) Dagstuhl Fault Tolerance Workshop

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance