720 likes | 729 Views
Explore the concept of self-stabilizing systems as a base for autonomic computing, emphasizing their fault-tolerance and ability to overcome transient errors.
E N D
Self-Stabilizing Systems as a Base for Autonomic Computing Shlomi Dolev Yinnon Haviv, Reuven Yagel, Olga Brukman
Trustworthy Systems: Why Is It So Hard? • Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ • "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm • From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"
Mars Rover - Spirit • …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. • …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… • …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046
Self-Stabilization • Self-healing, Self-managing, Self-* • Recovery Oriented Computing [Berkeley, Stanford] • Autonomic Computing [IBM] • Self-Stabilization • Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74]
Self-Stabilizing Systems • Elegant fault tolerant approach. Started at any state, the system convergences to a desired behavior. • Generally used in distributed systems. • Routing, clock synchronization, leader election, etc. • Overcome transient faults in the system. • Transient faults: soft-errors (“98% of RAM errors are soft errors”), wrong CRC during communication etc.
Self-Stabilization • The combination and type of faults cannot be totally anticipated in on-going systems • Any on-going system must be self stabilizing (or manually monitored) E L
Token Passing 1 P1:do forever 2if x1=xnthen 3x1:=(x1+1)mod(n+1) 4 Pi(i ≠ 1):do forever 5 ifxi≠xi-1then 6 xi:=xi-1
Token Passing Cont. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; {1; 1; 1; 1; 0}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} … • Surely works when we start in x1 = x2 = … = xn = 0. • One processor may change a state at a time.
Token Passing: Faults • Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. • Assigns each processor with an arbitrary state (in the range of its state space). • For example {3; 4; 4; 1; 0}. • p2; p4; and p5 have tokens! • Will the system ever recover?
Token Passing: Automatic Recovery • p1 changes state infinitely often, • Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...
Token Passing: Automatic Recovery Cont. • In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. • Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.
Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p1 {1,0,2,1,0} p5 {1,0,2,1,1} p4 {1,0,2,2,1} p3 {1,0,0,2,1} p2 {1,1,0,2,1} +1 mod 3 !
Talk Outline • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recovery Oriented Programming[BD05]
Self-Stabilizing MicroprocessorOvercoming Soft-Errors Shlomi Dolev and Yinnon A. Haviv 17th International Conference on Architecture of Computing Systems (ARCS)
Motivation • Soft-Errors: Single Event Upsets (SEU) • Caused by cosmic ray / other disruptions. • Cause a logical gate to flip its content. • Currently handled only in memories. • Significant impact on the microprocessors.
Soft-Errors - Current Solutions • Obtaining masking using probabilistic approaches: • Information redundancy (ECC / Parity) • Space redundancy • Time redundancy • Failure detection / recovery. • Known solutions: • IBM S-390 • Compaq NonStop Himalaya • IROC
Self-Stabilizing Microprocessor • Self-stabilizing algorithms assume that the microprocessor executes them. • Soft-errors may cause the microprocessor to be stuck in a faulty state. • A microprocessor is self-stabilizes if: • Started in any internal state, converges in a finite number of steps into the set of safe states. • Microprocessor’s safe state – in which it performs “fetch-decode-execute” cycle
h D a g j A b E i l d B c F k C e f Proving Convergence • Proving that there exists no “bad” cycle in the transition graph of the microprocessor. • Too large ! (we must explore the entire graph) • Using an abstraction:~ Group together states in which the micro-code program counter is the same.
Self-Stabilizing Microprocessor: Summary • Soft-errors are here to stay, we should: • Design our systems to mask them. • Self-stabilize following a non-masked error. • We provide methodology for validating self-stabilization property of microprocessors.
Talk Outline • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recovery Oriented Programming[BD05]
Toward Self-Stabilizing Operating System (SOS) Shlomi Dolev and Reuven Yagel, SAACS’04 Workshop, Zaragoza
Talk Outline • The first self-stabilizing algorithm (of Dijkstra) • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]
Basic Directions • Black-box • Take existing OS (Unix, Windows, RTOS) • Add stabilization layer • Carefully tailoring a tiny kernel • Processor scheduling • Memory management • Device allocation
Assumptions • Every configuration (processor/memory) is possible • At least some program code is hardwired (in ROM) and is correct – Harvard Model • Processor: • Instruction manual (e.g. x86\IA-32) defines a transition function. • Self-stabilizing [DH04]
Black Box • Requirements: • Defining a legal execution is usually impractical • At least - restore original state (variables + code), infinitely often • Periodic Reset Re-install and Execute • Watchdog timer (self-stabilizing) • Periodic processor reset • During bootstraps OS reinstall from ROM • Weak self-stabilization • E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …. • Is it always acceptable? • Alternative: Periodic re-install code only, add consistency check and enforcement
Tailored Kernel • Tiny Scheduler Tiny Memory Manager • Requirements: • Self-stabilizing • Fair • Process stabilization preserving (e.g. validity of P.C. value)
Tiny SOS Scheduler • ~70 lines of a real machine assembly code • 16bit Real mode & 32bit Protected mode. • Standard build and emulation tools (Nasm, ld, Bochs) • Detailed proof of requirement preservation ; increase task 10 mov word ax, [currentProc] 11 and ax, PROC_MASK ... ; load task state ... ;restore ip 52 mov ax, [bx+4] ;validate ip 53 and ax, IP_MASK 54 mov word [ss:STACK TOP], ax ;restore general registers 55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18]
Tiny SOS Memory Manager • Requirements: • Consistency of memory hierarchy • Self-stabilization preservation
Any State Process(ing) Establish Scheduler Consistency Next Process Validated & Ready Tiny SOS Scheduler Clock tick / execute next Some Error Some Error Some Error
Any State Process(ing) Establish Scheduler Consistency Next Process Validated & Ready Tiny SOS Scheduler Clock tick / execute next NMI / load PC with scheduler handler
Sketch of Proof • In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often • In every execution E of the scheduler each process is executed infinitely often • The self-stabilizing scheduler preservers stabilization of processes.
Talk Outline • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]
Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel
Motivation • Transient malfunctions. • Single processor: • Hardware glitches. • Soft-Errors. • Distributed environment: • Processor crashes / recoveries. • Link errors. • Resulting in an unpredictable system state.
Coping with Transient Errors • Masking (safety factor) achieved by: • Information redundancy (e.g., ECC). • Time/Space redundancy. (e.g., TMR) • Self-Stabilization [Dijkstra74]: • Assuming any system state (caused by errors). • Recovering by converging into legal behavior. • Existing algorithms for distributed tasks: • Routing, leader election, mutual exclusion, etc.
Self-Stabilizing Algorithms – a Solution to Soft-Errors? • Self-Stabilizing algorithm assumes that the microprocessor executes it. • Soft-Errors may cause the microprocessor to be stuck in a faulty state. • Composition of self-stabilizing algorithms creates a self-stabilizing system. • Make the microprocessor eventually fetch-decode-execute machine code.
The Gap. • Need a transformation between: • Input program P written in a high abstraction language, e.g., (D)ASM. • Output program Q in a machine language, say, JVM. • Existing compilers? • P and Qbehaves the same when started in the initial state. • What if Q reaches an unexpected state due to soft-error experienced by microprocessor?
Trivial Example mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop • A statement of the form: For each i in {0..9} do f(i) • May be compiled to • Start with cx=12 inside the loop… • Moreover: Any runtime mechanism can get stuck / inconsistent.
Stabilization Preserving Compiler – a closer look Ensuring that Q eventually behaves as P: • State space of P • State space of Q
Self-Stabilization Preserving Compiler: Summary • Front end of compiler established. • Typed version of ASM. • JavaCC as a parser generator. • Interpreter (used as a model). • Fast stabilization vs. optimizations. • Self Stabilization preserving compiler. • Language with clear semantics from any state. • Innovative demands from compiler.
Talk Outline • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]
Self-Stabilization and Evolving Systems • Real world systems cannot be verified exhaustively… • We enforce safety and live-ness specifications • Contract between the client, project manager and programmers, that is checked on line! • Make sure that the additional (thin) monitoring and recovering layer is self-stabilizing • A change can be made to the implementation/specification to support evolving environments
Self-Stabilizing Recoverer for Eventual Byzantine Software Olga Brukman, Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel Hillel Kolodner, Haifa Research Labs IBM, Israel
Software Contains Bugs • Heisenbugs, corrupt states, leaked resources are common… • Correct and faultless SW is hard • Long-lived running programs, e.g., OS • Usually software is tested when starting from initial state and considering limited time scenarios.
Fault Model Reflecting Reality • Software packages can be trusted to work as required after restart. • Eventual Byzantine software. • System administrators and users use reboot to deal with faults.