1 / 10

A 100,000 Ways to Fa

Advanced Scientific Computing Research. A 100,000 Ways to Fa. il. Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory. July 9, 2002 Fast-OS Workshop. Research sponsored by Mathematics, Information and Computational Sciences Office U.S. Department of Energy.

page
Download Presentation

A 100,000 Ways to Fa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Scientific Computing Research A 100,000 Ways to Fa il Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Research sponsored by Mathematics, Information and Computational Sciences Office U.S. Department of Energy

  2. Fault Tolerance on 100,000 processors • Mean time to failure of 100,000 cpu system will be measured in minutes. • problem when MTTF is O(synch time) • less than application startup time • How long do I wait • to find out if something has gone wrong? • memory is a million cycles away. ? It worked fine on the validation test sizes • Validation of applications - when... • it has subtle synchronization error at p>78,347 • HW failure looses my asynchronous message or changes its bits (myrinet)

  3. When you are a Failure and don’t know it The Uncertainty Principle of Large-scale Computing • Speed is not a problem if the answer doesn’t have to be right. • Do you care? Numerical accuracy (pgon) • Do you know? Validation issues • Scientists seem overly concerned about getting the right answer. • Concern for their reputation and integrity • Some even feel the answer is the product • Such is not the case for those who report how fast their computers are (or who run Enron). Do I have to beat the right answer out of you?

  4. Fault Tolerance – today’s system approach • There are three main steps in traditional fault tolerance • Detectionthat something has gone wrong • System – detection in hardware • Framework – detection by runtime environment • Library – detection in math or communication library • Notification of the application • Interrupt – signal sent to job • Error code returned by application routine • Recoveryof the application to the fault • Restart – from checkpoint or from beginning • Migration of task to other hardware • Reassignment of work to remaining tasks Now we are cooking!

  5. Fault Tolerance – application recovery • There are three main steps in traditional fault tolerance • Detectionthat something has gone wrong • Application depends on runtime to do this • Notification of the application • Interrupt – application gets a signal (limited info) • Error code returned by library (got to check for it) • Recoveryof the application to the fault • Restart – app typically needs to include restart routine • Run Through failure – app needs a fault tolerant programming model (eg. FT-MPI) • - Reassignment of work to remaining tasks • - Lost information/state/messages a big concern for run through Not another hurdle!

  6. Fault Tolerance – a new perspective Checkpointing and restarting a 100,000 processor system could take longer than the time to the next failure. It isn’t a good use of the resource to restart 99,999 nodes just because one failed. A new perspective on fault tolerance is needed. Development of algorithms that can be Scale invariant and naturally fault tolerant I.e. failure anywhere can be ignored? ORNL has developed a few naturally fault tolerant algorithms. Many such algorithms exist. Approach can also address validation finite difference example

  7. Harness Distributed Control Need for Adaptive System Software • KISS petaflop system software. Do we need 100,000 copies of Linux? Can microkernel do the job? • Dynamically configure environment to app needs • Less to break, less to watch • Needs to automatically detect and adapt to “changes” in the system. Note problems happen at petaflop speeds! • Cost of hardware support for detection • Migration of tasks away from bad spots. • Reroute messages around failures. • Distributed Control for fault tolerance

  8. System Software Environments Breakout Report June 27, 2002

  9. Anemic Areas of Existing Research • What are the under funded critical research issues? • Security – protecting systems from compromise • Fault tolerance – being able to run through failure. Three steps: detection, notification, system recovery • Linkage between projects and groups for Fault tolerance chain of events through recovery. • Validation of result – how know that app got right ans. Given system runs through failure.

  10. Potential Gaps in Research • What are the gaps in MICS research portfolio related to peta-scale computing? • What happens after Linux? • OS that supports the programming model – if it is going to change from distributed memory message passing then… • Alternate to microkernel approach “Concrete” holds everything together but is really heavy. • Local address space vs more expansive address support • Runtime & Programming modelsneed to givefeedback to OS so it can reconfigure to optimize needs • Experimental architecture research testbed dos OS2 linux IRIX next?

More Related