1 / 6

Report on 2002 Fault Tolerance Workshop

Report on 2002 Fault Tolerance Workshop. Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories. Motivation. Large COTS systems are prone to failures Lots of parts; complex configurations Applications stress the systems

sahara
Download Presentation

Report on 2002 Fault Tolerance Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Report on 2002 FaultTolerance Workshop Patricia D. Hough Computational Sciences and Mathematics Research Department Sandia National Laboratories

  2. Motivation • Large COTS systems are prone to failures • Lots of parts; complex configurations • Applications stress the systems • Few options for application survival • University resources are untapped • DOE researchers unfamiliar with fault tolerance experts • University researchers unfamiliar with DOE problem domain Goal: Bring laboratory and university researchers together to educate each other and discuss issues associated with scalable fault tolerance.

  3. Basic Info • June 10-11, 2002 in Albuquerque, NM • ~40 attendees • Cornell, Denison, Florida, Houston, Indiana, LANL, LLNL, MSTI, SNL, Tennessee, UT Austin • Interest exceeded capacity • Organized by Patty Hough (SNL), Tom Bressoud (Denison), and Lee Ward (SNL) • Sponsored by the CSRI

  4. Agenda • 11 invited talks + 2 hours focused discussion on: • Application descriptions and needs • System monitoring • MPI fault tolerance • Traditional approaches with a twist • Topics not covered • Checkpoint-free algorithms • Preventative measures • System services • Migration • Redistribution • Validation • Run-time environments

  5. Conclusions • MPI support is needed • Programming model needs to be considered • Balance research with timely delivery of capabilities • New ideas are needed • Leverage hardware • More systematic, integrated approach • There are still outstanding issues • Transparency vs. intrusiveness • Can traditional approaches be made scalable? • Workshop was a great success!

  6. For more information… http://csmr.ca.sandia.gov/projects/ftalgs.html

More Related