70 likes | 150 Views
Fault Tolerance and the Common Component Architecture. David E. Bernholdt ORNL. Center for Improvement of Fault Tolerance in Systems (CIFTS). Participants: Institution – PI ANL – Beckman (Lead PI) Indiana U – Lumsdaine LBNL – Hargrove Ohio State U – Panda ORNL – Geist
E N D
Fault Tolerance and the Common Component Architecture David E. Bernholdt ORNL
Center for Improvement of Fault Tolerance in Systems (CIFTS) • Participants: Institution – PI • ANL – Beckman (Lead PI) • Indiana U – Lumsdaine • LBNL – Hargrove • Ohio State U – Panda • ORNL – Geist • U Tennessee – Dongarra • Submitted as SciDAC 2 CET • Funded as base program • Also known as FOBAWS, Faulty
Fault Tolerance Backplane (FTB) • The core idea of CIFTS • Event service to convey fault information throughout the software stack and the machine • Hardware sensors to OS/runtime to libraries to applications • FTB components may generate or consume fault-related events • Prediction, adaptation, response
ANL (Beckman) Parallel file systems MPI, MPI-IO Linux Scheduler/resource manager Indiana U (Lumsdaine) MPI, MPI-IO LBNL (Hargrove) Checkpoint/restart Ohio State U (Panda) Interconnect ORNL (Geist) CCA integration Applications (chemistry, fusion) U Tennessee (Dongarra) Scalapack, math libraries Planned Areas of Activity
CCA Integration • CCA components should be able to consume or generate FTB events • Adapter between FTB and CCA event service • Main focus • CCA applications will be consumers of FTB information • Interesting secondary possibilities • Allow CCA components to plug into FTB as to provide adaptation/response services • FTB could be derived from CCA event service
Summer Plans • MCMD architecture for SWIM Integrated Plasma Simulator (IPS) to be developed this summer by Samantha Foley (IU) • Not CCA-compliant, but will provide use case and prototype implementation for CCA MCMD discussions • Demonstration of FT in MCMD IPS to be developed by Aniruddha Shet • Focus on a few simple MCMD-relevant events • Use IPS event service (FTB not designed)
Discussion: What FT Events are Relevant to CCA? • Presumably many FT events are interesting to CCA applications, utility components • What FT events are relevant to CCA itself? • framework • parallel • distributed • services • …