650 likes | 735 Views
On Cosmic Rays, Bat Droppings, and what to do about them. David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August. A Little-Publicized Fact. 1 + 1 =. 2. 3. How do Soft Faults Happen?. “Galactic Particles” Are high-energy particles that
E N D
On Cosmic Rays, Bat Droppings, and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August
A Little-Publicized Fact 1 + 1 = 2 3
How do Soft Faults Happen? “Galactic Particles” Are high-energy particles that penetrate to Earth’s surface, through buildings and walls • High-energy particles pass through devices and collides with silicon atom • Collision generates an electric charge that can flip a single bit “Solar Particles” Affect Satellites; Cause < 5% of Terrestrial problems Alpha particles from bat droppings
How Often do Soft Faults Happen? IBM Soft Fail Rate Study; Mainframes; 83-86 Leadville, CO Denver, CO Tucson, AZ NYC
How Often do Soft Faults Happen? IBM Soft Fail Rate Study; Mainframes; 83-86 [Zeiger-Puchner 2004] Leadville, CO Denver, CO Tucson, AZ NYC • Some Data Points: • 83-86: Leadville (highest incorporated city in the US): 1 fail/2 days • 83-86: Subterrean experiment: under 50ft of rock: no fails in 9 months • 2004: 1 fail/year for laptop with 1GB ram at sea-level • 2004: 1 fail/trans-pacific roundtrip [Zeiger-Puchner 2004]
How Often do Soft Faults Happen? Soft Error Rate Trends [Shenkhar Borkar, Intel, 2004] 6 years from now we are approximately here
How Often do Soft Faults Happen? Soft Error Rate Trends [Shenkhar Borkar, Intel, 2004] 6 years from now we are approximately here • Soft error rates go up as: • Voltages decrease • Feature sizes decrease • Transistor density increases • Clock rates increase all future manufacturing trends
How Often do Soft Faults Happen? • In 1948, Presper Eckert notes that cascading effects of a single-bit error destroyed hours of Eniac’s work. [Zeiger-Puchner 2004] • In 2000, Sun server systems deployed to America Online, eBay, and others crashed due to cosmic rays [Baumann 2002] • “The wake-up call came in the end of 2001 ... billion-dollar factory ground to a halt every month due to ... a single bit flip” [Zeiger-Puchner 2004] • Los Alamos National Lab Hewlett-Packard ASC Q 2048-node supercomputer was crashing regularly from soft faults due to cosmic radiation [Michalak 2005]
What Problems do Soft Faults Cause? • a single bit in memory gets flipped • a single bit in the processor logic gets flipped and • there’s no difference in external observable behavior • the processor locks up • the computation is silently corrupted • register value corrupted (simple data fault) • control-flow transfer goes to wrong place (control-flow fault) • different opcode interpreted (instruction fault)
FT Solutions • Redundancy in Information • eg: Error correcting codes (ECC) • pros: protects stored values efficiently • cons: difficult to design for arithmetic units and control logic • Redundancy in Space • multiple redundant hardware devices • eg: Compaq Non-stop Himalaya runs two identical programs on two processors, comparing pins on every cycle • pros: efficient in time • cons: expensive in hardware (double the space) • Redundancy in Time • perform the same computations at different times (eg: in sequence) • pros: efficient in hardware (space is reused) • cons: expensive in time (slower --- but not twice as slow)
Solutions in Time Compiler generates code containing replicated computations, fault detection checks and recovery routines • eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05], ... • pros: software-controlled --- new code with better reliability properties may be deployed whenever, wherever needed • cons: for fixed reliability policy, slower than specialized hardware solutions
Solutions in Time Compiler generates code containing replicated computations, fault detection checks and recovery routines • eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al. 05], ... • pros: flexibility --- new code with better reliability properties may be deployed whenever, wherever needed • cons: for fixed reliability policy, slower than specialized hardware solutions • cons: it might not actually work
Agenda • Answer basic scientific questions about software-controlled fault tolerance: • Do software-only or hybrid SW/HW techniques actually work? • For what fault models? How do we specify them? • How can we prove it? • Build compilers that produce software that runs reliably on faulty hardware • Moreover: Let’s not replace faulty hardware with faulty software. • Let’s prove every binary we produce is fault tolerant relative to the specified fault model
A Possible Compiler Architecture compiler front end ordinary program reliability transform fault tolerant program optimization optimized FT program
A Possible Compiler Architecture compiler front end ordinary program reliability transform Testing Requirements: all combinations of features multiplied by all combinations of faults fault tolerant program optimization optimized FT program
A More Reliable Compiler Architecture compiler front end ordinary program reliability transform fault tolerant program reliability proof optimization optimized FT program modified proof proof checker
A More Reliable Compiler Architecture compiler front end ordinary program reliability transform Testing Requirements: all combinations of features multiplied by all combinations of faults fault tolerant program reliability proof optimization optimized FT program modified proof proof checker
Central Technical Challenges • Designing
Step 1: Lambda Zap • Lambda Zap [ICFP 06] • a lambda calculus that exhibits intermittent data faults + operators to detect and correct them • a type system that guarantees observable outputs of well-typed programs do not change in the presence of a single fault • types act as the “proofs” of fault tolerance • expressive enough to implement an ordinary typed lambda calculus • End result: • the foundation for a fault-tolerant typed intermediate language
The Fault Model • Lambda zap models simple data faults only ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) • Not modelled: • memory faults (better protected using ECC hardware) • control-flow faults (ie: faults during control-flow transfer) • instruction faults (ie: faults in instruction opcodes) • Goal: to construct programs that tolerate 1 fault • observers cannot distinguish between fault-free and 1-fault runs
Lambda to Lambda Zap: The main idea let x = 2 in let y = x + x in out y
Lambda to Lambda Zap: The main idea let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] replicate instructions let x = 2 in let y = x + x in out y atomic majority vote + output
Lambda to Lambda Zap: The main idea let x1 = 2 in let x2 = 2 in let x3 = 7 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] let x = 2 in let y = x + x in out y
Lambda to Lambda Zap: The main idea let x1 = 2 in let x2 = 2 in let x3 = 7 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] let x = 2 in let y = x + x in out y corrupted values copied and percolate through computation but final output unchanged
Lambda to Lambda Zap: Control-flow recursively translate subexpressions let x1 = 2 in let x2 = 2 in let x3 = 2 in if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]] let x = 2 in if x then e1 else e2 majority vote on control-flow transfer
Lambda to Lambda Zap: Control-flow recursively translate subexpressions let x1 = 2 in let x2 = 2 in let x3 = 2 in if [x1, x2, x3] then [[ e1 ]] else [[ e2 ]] let x = 2 in if x then e1 else e2 majority vote on control-flow transfer (function calls replicate arguments, results and function itself)
Almost too easy, can anything go wrong?... yes! optimization reduces replication overhead dramatically (eg: ~ 43% for 2 copies), but can be unsound! original implementation of SWIFT [Reis et al.] optimized away all redundancy leaving them with an unreliable implementation!!
Faulty Optimizations let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] CSE In general, optimizations eliminate redundancy, fault-tolerance requires redundancy.
The Essential Problem bad code: let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] voters depend on common value x1
The Essential Problem good code: bad code: let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] voters depend on common value x1 voters do not depend on a common value
The Essential Problem good code: bad code: let x1 = 2 in let x2 = 2 in let x3 = 2 in let y1 = x1 + x1 in let y2 = x2 + x2 in let y3 = x3 + x3 in out [y1, y2, y3] let x1 = 2 in let y1 = x1 + x1 in out [y1, y1, y1] voters depend on a common value voters do not depend on a common value (red on red; green on green; blue on blue)
A Type System for Lambda Zap • Key idea: types track the “color” of the underlying value & prevents interference between colors Colors C ::= R | G | B Types T ::= C int | C bool | C (T1,T2,T3) (T1’,T2’,T3’)
Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . simple value typing rules: (x : T) in G --------------- G |--z x : T ------------------------ G |--z C n : C int ------------------------------ G |--z C true : C bool
Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . sample expression typing rules: G |--z e1 : C int G |--z e2 : C int ------------------------------------------------- G |--z e1 + e2 : C int G |--z e1 : R bool G |--z e2 : G bool G |--z e3 : B bool G |--z e4 : T G |--z e5 : T ----------------------------------------------------- G |--z if [e1, e2, e3] then e4 else e5 : T G |--z e1 : R int G |--z e2 : G int G |--z e3 : B int G |--z e4 : T ------------------------------------ G |--z out [e1, e2, e3]; e4 : T
Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . recall “zap rule” from operational semantics: ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) before: |-- v1 : T after: |-- v2 ?? T ==> how will we obtain type preservation?
Sample Typing Rules Judgement Form: G |--z e : T where z ::= C | . recall “zap rule” from operational semantics: ( M, F[ v1 ] ) ---> ( M, F[ v2 ] ) before: no conditions |-- v1 : C U “faulty typing” occurs within a single color only. after: ---------------------- G |--CC v : C U |--C v2 : C U by rule:
Theorems • Theorem 1: Well-typed programs are safe, even when there is a single error. • Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors [with a caveat]. • Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap [that satisfies the caveat]. ICFP 06
The Caveat Goal: 0-fault and 1-fault executions should be indistinguishable bad, but well-typed code: out [2, 3, 3] outputs 3 after no faults out [2, 3, 3] out [2, 2, 3] outputs 2 after 1 fault More importantly: out [2, 3, 3] is obviously a symptom of a compiler bug out [2, 3, 4] is even worse – good runs never come to consensus Solution: computations must independent, but equivalent
The Caveat modified typing: G |--z e1 : R U G |--z e2 : G U G |--z e3 : B U G |--z e4 : T G |--z e1 ~~ e2 G |--z e2 ~~ e3 ---------------------------------------------------------------------------- G |-- out [e1, e2, e3]; e4 : T
The Caveat More generally, programmers may form “triples” of equivalent values Introduction form: Elimination form: [e1, e2, e3] let [x1, x2, x3] = e1 in e2 • a collection of 3 items • each of 3 stored in separate register • single fault effects at most one
The Caveat More generally, programmers may form “triples” of equivalent values Introduction form: Elimination form: G |--z e1 : R U G |--z e2 : G U G |--z e3 : B U G |--z e1 ~~ e2 G |--z e2 ~~ e3 --------------------------------------------- G |-- [e1, e2, e3] : [R U, G U, B U] G |--z e1 : [R U, G U, B U] G, x1:R U, x2:G U, x3:B U, x1 ~ x2, x2 ~ x3 |--z e2 : T --------------------------------------------- G |-- let [x1, x2, x3] = e1 in e2 : T
Theorems* • Theorem 1: Well-typed programs are safe, even when there is a single error. • Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors. • Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap. * There is still one “i” to be dotted in the proofs of these theorems. Lester Mackey, brilliant Princeton undergrad, has proven all key theorems modulo the dotted “i”.
Step 2: Fault Tolerant Typed Assembly Language (TAL/FT) • Lambda zap is playground for studying the principles of fault tolerance in an idealized setting • TAL/FT is a more realistic assembly-level, hybrid HW/SW, fault tolerance scheme with • a formal fault model • a formal definition of fault tolerance relative to memory-mapped I/O • a sound type system for proving compiled programs are fault tolerant
A More Reliable Compiler Architecture compiler front end ordinary program reliability transform types reliability proof TAL/FT optimization types optimized TAL/FT modified proof type proof checker
TAL/FT: Key Ideas (Fault Model) • Fault model: • registers may incur arbitrary faults in between execution of any two instructions • memory (including code) is protected by ECC • fault model formalized as part of hardware operational semantics
TAL/FT: Key Ideas (Properties) store read Mem-mapped I/O device ECC-protected memory Processor • Primary Goal: if there is one fault then either • Mem-mapped I/O device sees exactly the same sequence • of stores as a fault-free execution, or • (2) Hardware detects and signals a fault and mem-mapped I/O • sees a prefix of the stores from a fault-free execution • Secondary Goal: no false positives
TAL/FT: Key Ideas (Mechanisms) • Compiler strategy • create two redundant computations as lambda zap • two copies ==> fault detection • fault recovery handled by a higher-level process • Hardware support • special instructions & modified store buffer for implementing reliable stores • special instructions for reliable control-flow transfers • Type system • Simple typing mechanisms based on original TAL [Morrisett, Walker, et al.] • Redundant values with separate colors like in lambda zap • Value identities needed for equivalence checking tracked using singleton types combined with some ideas drawn from traditional Hoare logics