280 likes | 285 Views
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing. Lecture 11 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org. Outline. Reminder midterm#2: April 7, Monday Dependability concepts (some review)
E N D
EEC 693/793Special Topics in Electrical EngineeringSecure and Dependable Computing Lecture 11 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org
Outline • Reminder • midterm#2: April 7, Monday • Dependability concepts (some review) • Fault, error and failure (some review) • Fault/failure detection in distributed systems • Consensus in asynchronous distributed systems EEC693: Secure & Dependable Computing
Dependable System • Dependability: • Ability to deliver service that can justifiably be trusted • Ability to avoid service failures that are more frequent or more severe than is acceptable • When service failures are more frequent or more severe than acceptable, we say there is a dependability failure • For a system to be dependable, it must be • Available - e.g., ready for use when we need it • Reliable - e.g., able to provide continuity of service while we are using it • Safe - e.g., does not have a catastrophic consequence on the environment • Secure - e.g., able to preserve confidentiality EEC693: Secure & Dependable Computing
Approaches to Achieving Dependability • Fault Avoidance - how to prevent, by construction, the fault occurrence or introduction • Fault Removal - how to minimize, by verification, the presence of faults • Fault Tolerance - how to provide, by redundancy, a service complying with the specification in spite of faults • Fault Forecasting - how to estimate, by evaluation, the presence, the creation, and the consequence of faults EEC693: Secure & Dependable Computing
Graceful Degradation • If a specified fault scenario develops, the system must still provide a specified level of service. Ideally, the performance of the system degrades gracefully • The system must not suddenly collapse when a fault occur, or as the size of the faults increases • Rather it should continue to execute part of the work load correctly EEC693: Secure & Dependable Computing
Quantitative Dependability Measures • Reliability -a measure of continuous delivery of proper service - or, equivalently, of the time to failure • It is the probability of surviving (potentially despite failures) over an interval of time • For example, the reliability requirement might be stated as a 0.999999 availability for a 10-hour mission. In other words, the probability of failure during the mission may be at most 10-6 • Hard real-time systems such as flight control and process control demand high reliability, in which a failure could mean loss of life EEC693: Secure & Dependable Computing
Quantitative Dependability Measures • Availability- a measure of the delivery of correct service with respect to the alternation of correct service and out-of-service • It is the probability of being operational at a given instant of time • A 0.999999 availability means that the system is not operational at most one hour in a million hours • A system with high availability may in fact fail. However, failure frequency and recovery time should be small enough to achieve the desired availability • Soft real-time systems such as telephone switching and airline reservation require high availability EEC693: Secure & Dependable Computing
Fault, Error, and Failure • The adjudged or hypothesized cause of an error is called a fault • An error is a manifestation of a fault in a system, in which the logical state of an element differs from its intended value • A service failure occurs if the error propagates to the service interface and causes the service delivered by the system to deviate from correct service • The failure of a component causes a permanent or transient fault in the system that contains the component • Service failure of a system causes a permanent or transient external fault for the other system(s) that receive service from the given system EEC693: Secure & Dependable Computing
Fault • Faults can arise during all stages in a computer system's evolution - specification, design, development, manufacturing, assembly, and installation - and throughout its operational life • Most faults that occur before full system deployment are discovered through testing and eliminated • Faults that are not removed can reduce a system's dependability when it is in the field • A fault can be classified by its duration, nature of output, and correlation to other faults EEC693: Secure & Dependable Computing
Fault Types - Based on Duration • Permanent faults are caused by irreversible device/software failures within a component due to damage, fatigue, or improper manufacturing, or bad design and implementation • Permanent software faults are also called Bohrbugs • Easier to detect • Transient/intermittent faults are triggered by environmental disturbances or incorrect design • Transient software faults are also referred to as Heisenbugs • Study shows that Heisenbugs are the majority software faults • Harder to detect EEC693: Secure & Dependable Computing
Fault Types - Based on Nature of Output • Malicious fault: The fault that causes a unit to behave arbitrarily or malicious. Also referred to as Byzantine fault • A sensor sending conflicting outputs to different processors • Compromised software system that attempts to cause service failure • Non-malicious faults: the opposite of malicious faults • Faults that are not caused with malicious intention • Faults that exhibit themselves consistently to all observers, e.g., fail-stop • Malicious faults are much harder to detect than non-malicious faults EEC693: Secure & Dependable Computing
Fail-Stop System • A system is said to be fail-stopif it responds to up to a certain maximum number of faults by simply stopping, rather than producing incorrect output • A fail-stop system typically has many processors running the same tasks and comparing the outputs. If the outputs do not agree, the whole unit turns itself off • A system is said to befail-safeif one or more safe states can be identified, that can be accessed in case of a system failure, in order to avoid catastrophe EEC693: Secure & Dependable Computing
Fault Types - Based on Correlation • Components fault may be independent of one another or correlated • A fault is said to be independentif it does not directly or indirectly cause another fault • Faults are said to be correlated if they are related. Faults could be correlated due to physical or electrical coupling of components • Correlated faults are more difficult to detect than independent faults EEC693: Secure & Dependable Computing
Fail Fast to Reduce Heisenbugs • The bugs that software developers hate most: • The ones that show up only after hours of successful operation, under unusual circumstances • The stack trace usually does not provide useful information • This kind of bugs might be caused by many reasons, such as • Not checking the boundary of an array • Invalid defensive programming <= what fail fast addresses • Reference • http://www.martinfowler.com/ieeeSoftware/failFast.pdf EEC693: Secure & Dependable Computing
Fail Fast to Reduce Heisenbugs • Invalid defensive programming • Making your software robust by working around problems automatically • This results in the software “failing slowly” • That is, it facilitates error propagation - the program continues working right after an error but fails in strange ways later on • Example: public int maxConnections() { string property = getProperty(“maxConnections”); if (property == null) { return 10; } else { return property.toInt(); } } EEC693: Secure & Dependable Computing
Fail Fast to Reduce Heisenbugs • Fail fast programming • When a problem occurs, it fails immediately & visibly • It may sound like it would make your software more fragile, but it actually makes it more robust • Bugs are easier to find and fix, so fewer go into production • Example: public int maxConnections() { string property = getProperty(“maxConnections”); if (property == null) { throw new NullReferenceException(“maxConnections property not found in “ + this.configFilePath); } else { return property.toInt(); } } EEC693: Secure & Dependable Computing
Failure Detection in Distributed Systems • Consider the failure detection problem in an asynchronous distributed system, where • No upper bound on process time • No upper bound on clock drift rate • No upper bound in networking delay • In an asynchronous distributed system, you cannot tell a crashed process from a slow one, even if you can assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through • This leads to Fischer, Lynch and Paterson to proof that it is impossible to reach a consensus in a fully asynchronous distributed system EEC693: Secure & Dependable Computing
Consensus Problem • Safety: • Only a value that has been proposed may be chosen • Only a single value is chosen, and • A process never learns that a value has been chosen unless it actually has been • Liveness: • Some proposed value is eventually chosen and, if a value has been chosen, then a process can eventually learn the value EEC693: Secure & Dependable Computing
Impossibility Results • FLP Impossibility of Consensus • A single faulty process can prevent consensus • Because a slow process is indistinguishable from a crashed one • Chandra/Toueg Showed that FLP Impossibility applies to many problems, not just consensus • In particular, they show that FLP applies to group membership, reliable multicast • So these practical problems are impossible in asynchronous systems • They also look at the weakest condition under which consensus can be solved • Ways to bypass the impossibility result • Use unreliable failure detector • Use a randomized consensus algorithm EEC693: Secure & Dependable Computing
The Paxos Algorithm • Contribution: separately consider safety and liveness issues. Safety can be guaranteed and liveness is ensured during period of synchrony • Participants of the algorithm are divided into three categories • Proposers: those who propose values • Accepters: those who decide which value to choose • Learners: those who are interested in learning the value chosen EEC693: Secure & Dependable Computing
The Paxos Algorithm • How to choose a value • Use a single acceptor: straightforward but not fault tolerant • Use a number of acceptors: a value is chosen if the majority of the acceptors have accepted it EEC693: Secure & Dependable Computing
The Paxos Algorithm • Requirements for choosing a value • P1. An acceptor must accept the first proposal that it receives • P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v • Since the proposal numbers are totally ordered, P2 guarantees the safety property EEC693: Secure & Dependable Computing
The Paxos Algorithm • How to guarantee P2? • P2a: If a proposal with value v is chosen, then every higher-numbered proposal accepted by any acceptor has value v • But what if an acceptor that has never accepted v accepted a proposal with v’? • P2b: if a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value v • P2b implies P2a, which implies P2 EEC693: Secure & Dependable Computing
The Paxos Algorithm • How to ensure P2b? • P2c: For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such that either • (a) no acceptor in S has accepted any proposal numbered less than n, or • (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S EEC693: Secure & Dependable Computing
The Paxos Algorithm • To ensure P2c, an acceptor must promise: • It will not accept any more proposals numbered less than n, once it has accepted a proposal n EEC693: Secure & Dependable Computing
The Paxos Algorithm • Phase 1. • (a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors. • (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted. EEC693: Secure & Dependable Computing
The Paxos Algorithm • Phase 2. • (a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals. • (b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n. EEC693: Secure & Dependable Computing
The Paxos Algorithm EEC693: Secure & Dependable Computing