820 likes | 934 Views
Stabilization Enabling Technology. Shlomi Dolev. Trustworthy Systems: Why is it So Hard?. Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/
E N D
Stabilization Enabling Technology Shlomi Dolev
Trustworthy Systems: Why is it So Hard? • Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ • "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm • From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"
Mars Rover - Spirit • …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works.. • …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended… • …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot http://www.eetimes.com/sys/news/OEG20040220S0046
Self-Stabilization • Self-healing, Self-managing, Self-* • Recovery Oriented Computing [Berkeley, Stanford] • Autonomic Computing [IBM] • Self-Stabilization • Self-Stabilizing algorithm for mutual exclusion in a ring topology [Dijkstra’74]
Self-Stabilization • The combination and type of faults cannot be totally anticipated in on-going systems • Any on-going system must be self stabilizing (or manually monitored) E L
Token Passing 1 P1:do forever 2if x1=xnthen 3x1:=(x1+1)mod(n+1) 4 Pi(i ≠ 1):do forever 5 ifxi≠xi-1then 6 xi:=xi-1
Token Passing Cont. {0; 0; 0; 0; 0}; {1; 0; 0; 0; 0}; {1; 1; 0; 0; 0}; {1; 1; 1; 0; 0}; {1; 1; 1; 1; 0}; {1; 1; 1; 1; 1}; {2; 1; 1; 1; 1}; {2; 2; 1; 1; 1}; {2; 2; 2; 1; 1}; {2; 2; 2; 2; 1}; {2; 2; 2; 2; 2} … • Surely works when we start in x1 = x2 = … = xn = 0. • One processor may change a state at a time.
Token Passing: Faults • Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc. • Assigns each processor with an arbitrary state (in the range of its state space). • For example {3; 4; 4; 1; 0}. • p2; p4; and p5 have tokens! • Will the system ever recover?
Token Passing: Automatic Recovery • p1 changes state infinitely often, • Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then • ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...
Token Passing: Automatic Recovery Cont. • In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing. • Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.
Will It Stabilize With mod (n - 2)? Mod 3 {0,0,2,1,0} p1 {1,0,2,1,0} p5 {1,0,2,1,1} p4 {1,0,2,2,1} p3 {1,0,0,2,1} p2 {1,1,0,2,1} +1 mod 3 !
Stabilization Stack • Self Stabilizing Microprocessor [DH04,DH06] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recovery Oriented Programming[BD05]
Implementation Bottleneck • Ask Intel, AMD, IBM to design a self-stabilizing microprocessor… • Technology for converting off-the shelf processor to be self-stabilizing [DH06] • Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self-stabilizing… • Stabilizing Virtual Machine [DY07]
Enforcing stabilization by resetting • Processors behave correctly after reset • Periodic reset ensures correct behavior • But damages closure… • Need careful solutions
Periodic Reset Monitor • Find a location P in OS code reached at least every T time • At P: • Save necessary information to RAM • Request a reset and loop forever. • Stabilizing watchdog accepts request and resets processor • Upon reset: restore information and continue • Stabilizing watchdog verifies that a reset is performed at least every T+ epsilon time
Implementationusing Intel XScale core • Used in numerous processors • Network, I/O, Handheld, Cellular etc. • RISC architecture (ARMv5 compatible) • Debug interface • Allows interaction between WD and OS • External debug break used for notifying the upcoming reset
Up to now • Virtual Self-stabilizing processor on top of commercial quality processor • Towards repeating the concept in OSs and VMMs (enforcing configuration and protecting critical operations)
Toward Self-Stabilizing Operating System (SOS) Shlomi Dolev and Reuven Yagel, SAACS’04 Workshop, Zaragoza
Basic Directions • Black-box • Take existing OS (Unix, Windows, RTOS) • Add stabilization layer • Carefully tailoring a tiny kernel Processor scheduling Memory management Device driver Hosting Byzantine processes
Assumptions • Every configuration (processor/memory) is possible • At least some program code is hardwired (in ROM) and is correct – Harvard Model • Processor: • Instruction manual (e.g. x86\IA-32) defines a transition function. • Self-stabilizing [DH04]
Black Box Periodic Reset Re-install and Execute • Watchdog timer (self-stabilizing) • Periodic processor reset • During bootstraps OS reinstall from ROM Weak self-stabilization • E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …. • Is it always acceptable? Alternative: Periodic re-install code only, add consistency check and enforcement
Tailored Kernel • Tiny Scheduler Tiny Memory Manager • Requirements: • Self-stabilizing • Fair • Process stabilization preserving (e.g. validity of P.C. value)
Tiny SOS Scheduler • ~70 lines of a real machine assembly code • 16bit Real mode & 32bit Protected mode. • Standard build and emulation tools (Nasm, ld, Bochs) • Detailed proof of requirement preservation ; increase task 10 mov word ax, [currentProc] 11 and ax, PROC_MASK ... ; load task state ... ;restore ip 52 mov ax, [bx+4] ;validate ip 53 and ax, IP_MASK 54 mov word [ss:STACK TOP], ax ;restore general registers 55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18]
Sketch of Proof • In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often • In every execution E of the scheduler each process is executed infinitely often • The self-stabilizing scheduler preservers stabilization of processes.
Talk Outline • Self Stabilizing Microprocessor [DH06] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For • Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]
Self-Stabilization Preserving Compiler Shlomi Dolev, Yinnon A. Haviv, Department of Computer Science Ben-Gurion University, Israel Mooly Sagiv, Department of Computer Science Tel Aviv University, Israel
The Gap. • Need a transformation between: • Input program P written in a high abstraction language, e.g., (D)ASM. • Output program Q in a machine language, say, JVM. • Existing compilers? • P and Qbehaves the same when started in the initial state. • What if Q reaches an unexpected state due to soft-error experienced by microprocessor?
Trivial Example mov ax, 10 mov cx, 0 loop1: push cx call f inc cx cmp cx,ax jne loop • A statement of the form: • For each i in {0..9} do f(i) • May be compiled to • Start with cx=12 inside the loop… • Moreover: Any runtime mechanism can get stuck / inconsistent.
Stabilization Preserving Compiler – a closer look Ensuring that Q eventually behaves as P: • State space of P • State space of Q
Enforce invariants Variable declarations Scheduler condition_1 … condition_n upon <condition_1> do <statement_1> Statement_1 upon <condition_n> do <statement_n> Statement_n The Transformation
Self-Stabilization Preserving Compiler: Summary • Front end of compiler for ASM. • Self Stabilization preserving compiler. • Language with clear semantics from any state. • New demands for a compiler.
Talk Outline • Self Stabilizing Microprocessor [DH04] • Self Stabilizing Operating System [DY04] • Self-Stabilization Preserving Compiler[DH05] • Self-Stabilizing Automatic Recoverer For Eventual Byzantine Software [BDK03] • Recover Oriented Programming[BD05]
Self-Stabilization and Evolving Systems • Real world systems cannot be verified exhaustively… • We enforce safety and live-ness specifications • Contract between the client, project manager and programmers, that is checked on line! • Make sure that the additional (thin) monitoring and recovering layer is self-stabilizing • A change can be made to the • implementation/specification • to support evolving environments
Self-Stabilizing Recoverer for Eventual Byzantine Software Olga Brukman, Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel Hillel Kolodner, Haifa Research Labs IBM, Israel
Software Contains Bugs • Heisenbugs, corrupt states, leaked resources are common… • Correct and faultless SW is hard • Long-lived running programs, e.g., OS • Usually software is tested when starting from initial state and considering limited time scenarios.
Fault Model Reflecting Reality • Software packages can be trusted to work as required after restart. • Eventual Byzantine software. • System administrators and users use reboot to deal with faults.
OS Kernel <Preds,RActs>1 <Preds,RActs>2 … <Preds,RActs>n <Preds,RActs> OMR <Preds,RActs> <Preds,RActs> <Preds,RActs> Middleware Architecture
<Pred,RActs>1 <Pred,RActs>2 … Monitor-Restarter for Process and Subsystem
Restart Actions – Mature Approach • Subsystem waits for completion of a restart of its components. • Restart action may vary, depending on component internal state. • Reschedule • Roll-back • Kill & Restart • Few restart attempts with more drastic restart actions.
Computational Model: rsf-execution • An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi. Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.
Tools for Implementation – Black Box Approach • Software package is ablack box. • Package is monitored by recording it’s IO (e.g., strace in Linux). • Monitors are independent of specific implementation
Tools for Implementation – Transparent Box Approach • Software package implementation tool is known. • Run-Time Reflection tools are used to monitor and restart the package. • Possible in Java, C++, CORBA, COM.
Practical Experience: Printers Problem • Corrupted pdf,doc or ps file sent to printing server. • Printer can’t print the file. • Cause retries by printing server • Printer is “stuck” on one job. • Predicate for printing server: • Restrict number of retries, try format conversions, send error message to user.
Recovery Oriented Programming Olga Brukman and Shlomi Dolev Department of Computer Science Ben-Gurion University, Israel
Towards Robust Software • Programming • Structural programming, OOD, Design Patterns… • Testing and debugging • Unit testing [JUnit, CppUnit]… • Design By Contract (Eiffel) … • Formal specification languages • ASM, IO Automata, NURPL • Model checking • Online recovery • ROC [PBB02]. • Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software [BDK03] - black box software packages.