550 likes | 674 Views
Dependability and real-time. If a system is to produce results within time constraints, it needs to produce results at all! How to justify and measure how well computer systems do their jobs?. Dependable systems. How do things go wrong and why? What can we do about it?
E N D
Dependability and real-time • If a system is to produce results within time constraints, it needs to produce results at all! • How to justify and measure how well computer systems do their jobs?
Dependable systems • How do things go wrong and why? • What can we do about it? • This lecture: Basic notions of dependability and replication in fault-tolerant systems • Next lecture: Designing Dependable Real-time systems
Early computer systems • 1944: Real-time computer system in the Whirlwind project at MIT, used in a military air traffic control system 1951 • Short life of vaccum tubes gave mean time to failure of 20 minutes
Engineers: Fool me once, shame on you – fool me twice, shame on me
Software developers: Fool me N times, who cares, this is complex and anyway no one expects software to work...
FT - June 16, 2004 • "If you have a problem with your Volkswagen the likelihood that it was a software problem is very high. Software technology is not something that we as car manufacturers feel comfortable with.” Bernd Pischetsrieder, chief executive of Volkswagen
October 2005 • “Automaker Toyota announced a recall of 160,000 of its Prius hybrid vehicles following reports of vehicle warning lights illuminating for no reason, and cars' gasoline engines stalling unexpectedly.” Wired 05-11-08 • The problem was found to be an embedded software bug
Early space and avionics • During 1955, 18 air carrier accidents in the USA (when only 20% of the public was willing to fly!) • Today’s complexity many times higher
Airbus 380 • Integrated modular avionics (IMA), with safety-critical digital components, e.g. • Power-by-wire: complementing the hydraulic powered flight control surfaces • Cabin pressure control (implemented with a TTP operated bus)
Dependability • How do things go wrong? • Why? • What can we do about it? • Basic notions in dependable systems and replication for fault tolerance (sv. Pålitliga datorsystem)
What is dependability? Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] The ability to avoid service failures that are more frequent or more severe than is acceptable. (sv. Pålitliga datorsystem)
Attributes of dependability IFIP WG 10.4 definitions: • Safety: absence of harm to people and environment • Availability: the readiness for correct service • Integrity: absence of improper system alterations • Reliability:continuity of correct service • Maintainability: ability to undergo modifications and repairs
Reliability • [Sv. Tillförlitlighet] • Means that the system (functionally) behaves as specified, and does it continually over measured intervals of time. • Typical measure in aerospace: 10-9 • Another way of putting it: MTTF - One failure in 109 flight hours.
Faults, Errors & Failures • Fault: a defect within the system or a situation that can lead to failure • Error: manifestation (symptom) of the fault - an unexpected behaviour • Failure: system not performing its intended function
Examples • Year 2000 bug • Bit flips in hardware due to cosmic radiation in space • Loose wire • Air craft retracting its landing gear while on ground Effects in time: Permanent/ transient/ intermittent
FaultÞ ErrorÞFailure • Goal of system verification and validation is to eliminate faults • Goal of hazard/risk analysis is to focus on important faults • Goal of fault tolerance is to reduce effects of errors if they appear - eliminate or delayfailures Some will remain…
More on dependability Four approaches [IFIP 10.4]: • Fault avoidance • Fault removal • Fault tolerance • Fault forecasting Next lecture
Google’s 100 min outage September 2, 2009: A small fraction of Gmail’s servers were taken offline to perform routine upgrades. “We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers .” Fault forecasting?
More on dependability Four approaches [IFIP 10.4]: • Fault avoidance • Fault removal • Fault tolerance • Fault forecasting
Fault tolerance • Means that a system provides a degraded (but acceptable) function • Even in presence of faults • During a period defined by certain model assumptions • Foreseen or unforeseen? • Fault model describes the foreseen faults
Fault models • Leading to Node failures • Crash • Omission • Timing • Byzantine • Leading to Channel failures • Crash (and potential partitions) • Message loss • Message delay • Erroneous/arbitrary messages
On-line fault-management • Fault detection • By program or its environment • Fault tolerance using redundancy • software • hardware • Data • Fault containment by architectural choices
Redundancy From D. Lardner: Edinburgh Review, year 1824: ”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods.” * people who compute
Static Redundancy Used all the time (whether an error has appeared or not), just in case… • SW: N-version programming • HW: Voting systems • Data: parity bits, checksums
Dynamic Redundancy Used when error appears and specifically aids the treatment • SW: Recovery methods, Exceptions • HW: Switching to back-up module • Data:Self-correcting codes • Time: Re-computing a result
X Server replication models • Passive replication • Active replication : Denotes a replica group
Increasing availability • Active replication • Group membership: who is up, who is down? • Passive replication • Primary – backup: bring the secondary server up to date • What do we need to implement to support transition to operational mode? • Message ordering • Agreement among replicas
Relating states Needs a notion of time (precedence) in distributed systems Recall from lecture 4: • If clocks are synchronised • The rate of processing at each node can be related • Timeouts can be used to detect faults as opposed to message delays • Asynchronous systems: • Abstract time via order of events
Distributed snapshot • Vector clocks help to synchronise at event level • Consistent snapshots • But reasoning about response times and fault tolerance needs quantitative bounds
”Chicken and egg” problem • Replication is useful in presence of failures if there is a consistent common state among replicas • To get consistency, processes need to communicate their state via broadcast • But broadcast algorithms are distributed algorithms that run on every node • and can be affected by failures...
Desirable broadcast properties • Reliable broadcast • all non-crashed processes agree on messages delivered (agreement) • no spurious messages (integrity) • all messages broadcast by non-crashed processes are delivered (validity) All or none!
How to implement? • The first step is to separate the underlying network (transport) and the broadcast mechanism • Distinguish between receipt and delivery of a message
Application layer Broadcast Send Deliver Broadcast protocol Receive Send Transport
Desirable broadcast properties • Reliable broadcast • all non-crashed processes agree on messages delivered (agreement) • no spurious messages (integrity) • all messages broadcast by non-crashed processes delivered (validity) All or none!
Example implementation At every node n • Execute broadcast(m) by: • adding sender(m) and a unique ID as a header to the message m (building m) • send(m) to all neighbours including itself • When receive(m): • if previously not executed deliver(m) then • if sender(m) /= n then send(m) to all neighbours • deliver(m)
Directly after a receipt? While relaying? After sending to some but not all neighbours? What if n fails? This is where failure models come in...
In presence of a given failure model! Algorithms for broadcast • Correctness: prove validity, integrity, agreement, order Typical assumptions: • no link failures leading to partition • send does not duplicate or change messages • receive does not ”invent” messages
The consensus problem Assume that we have a reliable broadcast • Processes p1,…,pn take part in a decision • Each piproposes a value vi • e.g. application state info • All correct processes decide on a common value v that is equal to one of the proposed values
Desired properties • Every non-faulty process eventually decides (Termination) • No two non-faulty processes decide differently (Agreement) • If a process decides v then the value v was proposed by some process (Validity) Algorithms for consensus have to be proven to have these properties In presence of given failure model
Basic impossibility result [Fischer, Lynch and Paterson 1985] There is no deterministic algorithm solving the consensus problem in an asynchronous distributed system with a single crash failure
A way around it Assume Synchrony: • Distributed computations proceed in rounds initiated by pulses • Pulses implemented using local physical clocks, synchronised assuming bounded message delays
NO! It depends on the fault model: Assumptions on how can the nodes/channels fail Can I prove any protocol correct?
Byzantine agreement Problem: • Nodes need to agree on a decision but some nodes are faulty, act in an arbitrary way (can be malicious)
Byzantine agreement protocol • Proposed in 1980 by Pease, Shostak and Lamport How many faulty nodes can the algorithm tolerate?
Result from 1980 • Theorem: There is an upper bound t for the number of byzantine node failures compared to the size of the network N, N 3t+1 • Gives a t+1 round algorithm for solving consensus in a synchronous network • Here: • We just demonstrate that t nodes would not be enough!
G G 1 L1 said 1 L2 said 0 1 L1 said 1 0 1 L2 said 0 1 G said 1 L1 L2 L1 L2 G said 0 0 Scenario 1 • G and L1 are correct, L2 is faulty
G G 0 L1 said 1 L2 said 0 1 L1 said 1 0 0 L2 said 0 1 G said 1 L1 L2 L1 L2 G said 0 0 Scenario 2 • G and L2 are correct, L1 is faulty