60 likes | 131 Views
Report on 2002 Fault Tolerance Workshop. Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories. Motivation. Large COTS systems are prone to failures Lots of parts; complex configurations Applications stress the systems
E N D
Report on 2002 FaultTolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories
Motivation • Large COTS systems are prone to failures • Lots of parts; complex configurations • Applications stress the systems • Few options for application survival • University resources are untapped • DOE researchers unfamiliar with fault tolerance experts • University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.
Basic Info • June 10-11, 2002 in Albuquerque, NM • ~40 attendees • Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin • Interest exceeded capacity • Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL) • Sponsored by the CSRI
Agenda • 11 invited talks + 2 hours focused discussion on: • Application descriptions and needs • System monitoring • MPI fault tolerance • Traditional approaches with a twist • Topics not covered • Checkpoint-free algorithms • Preventative measures • System services • Migration • Redistribution • Validation • Run-time environments
Conclusions • MPI support is needed • Programming model needs to be considered • Balance research with timely delivery of capabilities • New ideas are needed • Leverage hardware • More systematic, integrated approach • There are still outstanding issues • Transparency vs. intrusiveness • Can traditional approaches be made scalable? • Workshop was a great success!
For more information… http://csmr.ca.sandia.gov/projects/ftalgs.html