100 likes | 238 Views
Advanced Scientific Computing Research. A 100,000 Ways to Fa. il. Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory. July 9, 2002 Fast-OS Workshop. Research sponsored by Mathematics, Information and Computational Sciences Office U.S. Department of Energy.
E N D
Advanced Scientific Computing Research A 100,000 Ways to Fa il Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Research sponsored by Mathematics, Information and Computational Sciences Office U.S. Department of Energy
Fault Tolerance on 100,000 processors • Mean time to failure of 100,000 cpu system will be measured in minutes. • problem when MTTF is O(synch time) • less than application startup time • How long do I wait • to find out if something has gone wrong? • memory is a million cycles away. ? It worked fine on the validation test sizes • Validation of applications - when... • it has subtle synchronization error at p>78,347 • HW failure looses my asynchronous message or changes its bits (myrinet)
When you are a Failure and don’t know it The Uncertainty Principle of Large-scale Computing • Speed is not a problem if the answer doesn’t have to be right. • Do you care? Numerical accuracy (pgon) • Do you know? Validation issues • Scientists seem overly concerned about getting the right answer. • Concern for their reputation and integrity • Some even feel the answer is the product • Such is not the case for those who report how fast their computers are (or who run Enron). Do I have to beat the right answer out of you?
Fault Tolerance – today’s system approach • There are three main steps in traditional fault tolerance • Detectionthat something has gone wrong • System – detection in hardware • Framework – detection by runtime environment • Library – detection in math or communication library • Notification of the application • Interrupt – signal sent to job • Error code returned by application routine • Recoveryof the application to the fault • Restart – from checkpoint or from beginning • Migration of task to other hardware • Reassignment of work to remaining tasks Now we are cooking!
Fault Tolerance – application recovery • There are three main steps in traditional fault tolerance • Detectionthat something has gone wrong • Application depends on runtime to do this • Notification of the application • Interrupt – application gets a signal (limited info) • Error code returned by library (got to check for it) • Recoveryof the application to the fault • Restart – app typically needs to include restart routine • Run Through failure – app needs a fault tolerant programming model (eg. FT-MPI) • - Reassignment of work to remaining tasks • - Lost information/state/messages a big concern for run through Not another hurdle!
Fault Tolerance – a new perspective Checkpointing and restarting a 100,000 processor system could take longer than the time to the next failure. It isn’t a good use of the resource to restart 99,999 nodes just because one failed. A new perspective on fault tolerance is needed. Development of algorithms that can be Scale invariant and naturally fault tolerant I.e. failure anywhere can be ignored? ORNL has developed a few naturally fault tolerant algorithms. Many such algorithms exist. Approach can also address validation finite difference example
Harness Distributed Control Need for Adaptive System Software • KISS petaflop system software. Do we need 100,000 copies of Linux? Can microkernel do the job? • Dynamically configure environment to app needs • Less to break, less to watch • Needs to automatically detect and adapt to “changes” in the system. Note problems happen at petaflop speeds! • Cost of hardware support for detection • Migration of tasks away from bad spots. • Reroute messages around failures. • Distributed Control for fault tolerance
System Software Environments Breakout Report June 27, 2002
Anemic Areas of Existing Research • What are the under funded critical research issues? • Security – protecting systems from compromise • Fault tolerance – being able to run through failure. Three steps: detection, notification, system recovery • Linkage between projects and groups for Fault tolerance chain of events through recovery. • Validation of result – how know that app got right ans. Given system runs through failure.
Potential Gaps in Research • What are the gaps in MICS research portfolio related to peta-scale computing? • What happens after Linux? • OS that supports the programming model – if it is going to change from distributed memory message passing then… • Alternate to microkernel approach “Concrete” holds everything together but is really heavy. • Local address space vs more expansive address support • Runtime & Programming modelsneed to givefeedback to OS so it can reconfigure to optimize needs • Experimental architecture research testbed dos OS2 linux IRIX next?