30 likes | 195 Views
Priority Research Direction (use one slide for each). Key challenges. Summary of research direction. barriers and gaps: -Cope with a continuous flow of different errors-faults including soft errors (silent or not) -Current techniques (ckpt/rest) will not scale
E N D
Priority Research Direction (use one slide for each) Key challenges Summary of research direction barriers and gaps:-Cope with a continuous flow of different errors-faults including soft errors (silent or not) -Current techniques (ckpt/rest) will not scale -Limits of Storage and file systems -Software stack is not fault aware -Provide Verification of large, long time scale simulation -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management -Resilience <--> new tech (Flash mem, virtualization) -Resilience <--> power consumption |heterogeneity -Resilient Storage and file systems -Extending applicability of checkpoint -Replication (backup core) <--> Rollback recovery -Fault recovery <--> Fault avoidance (migration) -Transparent <--> Application guided -Resilience and programming/execution models -Resilient apps. & algo. possibly with OS support -Language / compiler support for resilience -Experimental env. to stress & compare solutions Potential impact on software component Better consistency in error/fault management across software layers [S] System + Application interactions to manage errors and fault [M] Naturally Resilient system [M] and application software [L] Potential impact on usability, capability, and breadth of community -Resilience is a key issue for the Exascale community -Enable tightly coupled applications to run longer -Better resilience will provide better efficiency (full system)
4.x Resilience Resilience is a critical issue to achieve high apps. throughput MTBF=day MTBF=<10h (based on DARPA report) MTBF=<1h MTBF=<10m MTBF=<1h MTBF=10h Terms: Short Medium Long Fault oblivious Applications Fault Repair Fault Avoidance Application should be able to dynamically handle errors Improved hardware and software reliability -- better RAS collection and analysis (root cause) -- Integration Extend applicability of checkpointing -- IO caching (e.g., NAND) -- New FT protocols System level fault-tolerance -- prediction for time optimal checkpointing and migration -- isolation and local recovery/management Net Throughput All software shouldbe fault aware and consistent 10 Peta 100 Peta 1 Exa 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
4.x Resilience • Technology drivers -Increase of the number of errors, variety of errors. -Huge increase of components and threads, Power management, New hardware (Flash Mem., Accel., ) -Increase of the data size, limit of centralized I/O, higher potential bandwidth of local storage. • Alternative R&D strategies -Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application directed-Replication (backup core) <--> Rollback Recovery (replicate locally and restart globally?) • Recommended research agenda -Fault understanding (RAS analysis), modeling, prediction [S-M] -Fault isolation/confinement + local management [M] -Virtualization [S] -Extending the applicability of Rollback recovery (reducing ckpt size, caching, scalable FT protocols) [S] -Resilient Storage and file systems [S-L] -Resilience and programming/execution models (MW, Map Reduce, Transactions) [M-L] -Language / compiler support for resilience [M] -Resilient apps & algorithms (forward recovery, NFTA, ABFT) possibly with OS support [L] -Experimental environment to stress envisioned solutions [M] • Crosscutting considerations -Resilience <--> power management, performance (fault free situation and when faults occur) -scalability, programmability,