60 likes | 170 Views
Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes. Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Approaches to Fault tolerance. Level:
E N D
Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign
Approaches to Fault tolerance • Level: • Application-level, system-level, runtime-level • Role of compiler support? • Categorization by • Objectives: increase progress rate, eliminate failure • {Coordinated, uncordinated} x {user-selected time, any-time} • Rollback, redundancy, migrate to avoid… • Proactive vs Reactive • Is it important that the non-failed nodes don’t have to go back • Utility of message logging • I/O system design • How to exploit the special nature of checkpoint traffic • OS interaction/specialization • Page-size • I/O threads, SMT • How much forward path overhead are the apps folk willing to live with? • When/where is FT “needed”? Dagstuhl Fault Tolerance Workshop 9/1/2014 2
LEVEL • App PRO: • “Good” apps already do it (may need it anyway for short queue slots, viz/debug, etc…) • “minimal” data • Portable and multi-purpose (egviz/debug) • App CON: • Not preemptive in general. • Lib or module stacking a problem? • Return to end of queue (at least currently) • Sys PRO: • Preemptive • Transparent, including lack of sources • Sys CON: • “bounding box” of data • Non portable and single-purpose • Runtime PRO: • Preemptive, adapative • “stacking” problem may be less of an issue • Runtime CON: • Non-transparent if you want minimal data Dagstuhl Fault Tolerance Workshop
Objectives • HPC • App must make progress (produce a result) • End-to-end or “total time to solution” (efficiency) • Cost in terms of finite allocation • Distributed Systems • Must not “stop/fail” (degrade instead) • Grids [varying viewpoints] • Optimize Through-put • preserve the workflow, perhaps w/o any process-level FT • May use FT in presence of deadlines and/or “large” tasks • Realtime? (experimental data is non-reproducable) • P2P (volunteer) • Progress/service in presence of dynamic PE pool • Protect against loss/corruption of data Dagstuhl Fault Tolerance Workshop
Coordinated or not? • Historically we have thought coordinated was too expensive • We see data here that suggests that the complexity of uncoordinated is prohibitive • “Scalability” is a dirty word • We don’t all think it means the same thing • Uncoordinated may contribute more to “noise” (but background I/O is still “noise” in any scheme) Dagstuhl Fault Tolerance Workshop
Final comments/thoughts • FT via checkpoint/restart (rollback-recovery) • Avoid “all go back” (degrees) • Really a good thing if avoids the requeue (job pause) • If don’t even need job pause avoid waste of power,etc. • OS interaction/specialization • Sanjay concerned about large pages. BLCR can/will use 4kb for “differential” regardless of the “real” size. • Frank wants better “hooks” w.r.t. page tables, etc. • Higher reliability kernels == better? (at what cost) • People are still writing new kernels • Linux w/imicrokernels • Filesystem state (snapshots, message logging, VMs do block device snapshots, transactional fs, what else?) Dagstuhl Fault Tolerance Workshop