1 / 6

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes. Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Approaches to Fault tolerance. Level:

kirk
Download Presentation

Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object-based Over-Decomposition Can Enable Powerful Fault Tolerance Schemes Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

  2. Approaches to Fault tolerance • Level: • Application-level, system-level, runtime-level • Role of compiler support? • Categorization by • Objectives: increase progress rate, eliminate failure • {Coordinated, uncordinated} x {user-selected time, any-time} • Rollback, redundancy, migrate to avoid… • Proactive vs Reactive • Is it important that the non-failed nodes don’t have to go back • Utility of message logging • I/O system design • How to exploit the special nature of checkpoint traffic • OS interaction/specialization • Page-size • I/O threads, SMT • How much forward path overhead are the apps folk willing to live with? • When/where is FT “needed”? Dagstuhl Fault Tolerance Workshop 9/1/2014 2

  3. LEVEL • App PRO: • “Good” apps already do it (may need it anyway for short queue slots, viz/debug, etc…) • “minimal” data • Portable and multi-purpose (egviz/debug) • App CON: • Not preemptive in general. • Lib or module stacking a problem? • Return to end of queue (at least currently) • Sys PRO: • Preemptive • Transparent, including lack of sources • Sys CON: • “bounding box” of data • Non portable and single-purpose • Runtime PRO: • Preemptive, adapative • “stacking” problem may be less of an issue • Runtime CON: • Non-transparent if you want minimal data Dagstuhl Fault Tolerance Workshop

  4. Objectives • HPC • App must make progress (produce a result) • End-to-end or “total time to solution” (efficiency) • Cost in terms of finite allocation • Distributed Systems • Must not “stop/fail” (degrade instead) • Grids [varying viewpoints] • Optimize Through-put • preserve the workflow, perhaps w/o any process-level FT • May use FT in presence of deadlines and/or “large” tasks • Realtime? (experimental data is non-reproducable) • P2P (volunteer) • Progress/service in presence of dynamic PE pool • Protect against loss/corruption of data Dagstuhl Fault Tolerance Workshop

  5. Coordinated or not? • Historically we have thought coordinated was too expensive • We see data here that suggests that the complexity of uncoordinated is prohibitive • “Scalability” is a dirty word • We don’t all think it means the same thing • Uncoordinated may contribute more to “noise” (but background I/O is still “noise” in any scheme) Dagstuhl Fault Tolerance Workshop

  6. Final comments/thoughts • FT via checkpoint/restart (rollback-recovery) • Avoid “all go back” (degrees) • Really a good thing if avoids the requeue (job pause) • If don’t even need job pause avoid waste of power,etc. • OS interaction/specialization • Sanjay concerned about large pages. BLCR can/will use 4kb for “differential” regardless of the “real” size. • Frank wants better “hooks” w.r.t. page tables, etc. • Higher reliability kernels == better? (at what cost) • People are still writing new kernels • Linux w/imicrokernels • Filesystem state (snapshots, message logging, VMs do block device snapshots, transactional fs, what else?) Dagstuhl Fault Tolerance Workshop

More Related