190 likes | 388 Views
ABCSG. Dependable Systems. Agenda. Dependable Computing Basic concepts Definitions Attributes Threads Means to attain dependability Fault prevention Fault removal Fault forecasting Fault tolerance -> Branch into techniques -> Branch into Coordinated Atomic Actions.
E N D
ABCSG Dependable Systems ABCSG - Dependable Systems - 01/06/2006
Agenda • Dependable Computing • Basic concepts • Definitions • Attributes • Threads • Means to attain dependability • Fault prevention • Fault removal • Fault forecasting • Fault tolerance -> Branch into techniques -> Branch into Coordinated Atomic Actions ABCSG - Dependable Systems - 01/06/2006
Dependable Computing - Definition • Ability to deliver service that can justifiably be trusted or • Ability of a system to avoid service failures that are more frequent or more severe than is acceptable ABCSG - Dependable Systems - 01/06/2006
Dependable Computing - Attributes ABCSG - Dependable Systems - 01/06/2006
Dependable Computing - Threats • Everything that can influence the system in such a way, that it will result in the system to fall outside the definition of dependable • Development phase • Physical world • Human developers • Development tools • Production and test facilities • Use phase • Physical world • Administrators • Users of services • Providers of services • Infrastructure • Intruders ABCSG - Dependable Systems - 01/06/2006
Means - Fault prevention • A failure is the result of an error • An error is the result of a fault => Prevent faults = prevent failure • Basically we all know how (right?) • Information hiding • Modularization • Strongly typed languages • ... ABCSG - Dependable Systems - 01/06/2006
Means - Fault removal • During development (also test fault tolerance by fault injection) • During use • Corrective maintenance • Preventive maintenance ABCSG - Dependable Systems - 01/06/2006
Means - Fault forecasting • The performance of a evaluation of the system behavior with respect to fault occurrence or activation. • Qualitative evaluation • Identify the failure modes or the event combinations that would lead to system failure. • Quantitative evaluation • Identify in terms of probabilities the extent to which some of the attributes of dependability are satisfied. ABCSG - Dependable Systems - 01/06/2006
Means - Fault tolerance • Fault prevention include human activities and is thus imperfect => We need fault removal • Fault removal include human activities and is thus imperfect => We need fault forecasting • Fault forecasting include human activities and is thus imperfect => We need fault tolerance • Fault tolerance include human activities and is thus imperfect => Systems will fail ... but a combination of all aforementioned techniques, can best lead to dependable computing ... so lets have a look at fault tolerance ABCSG - Dependable Systems - 01/06/2006
Fault tolerance • Recall that fault tolerance is one of the means to attain dependable systems • Terminology and key concept • Fault -> Error -> Failure • Failure semantics • Redundancy • Techniques • Sequential • Independent concurrent systems • Competitive concurrent systems • Cooperative concurrent systems • Hybrid systems ABCSG - Dependable Systems - 01/06/2006
Fault tolerance - Terminology and key concept • A failure is the observation of an erroneous system state • An error is an erroneous system state, which might lead to a failure • A fault is a system defect, which might lead to an error ABCSG - Dependable Systems - 01/06/2006
Fault tolerance - Terminology and key concept English • A failure is a consequence of an error that is the consequence of a fault • Fault => Error => Failure Dansk • En fejl er konsekvensen af en fejl som er konsekvensen af en fejl • Fejl => Fejl => Fejl (Tænk lidt over den) ABCSG - Dependable Systems - 01/06/2006
Fault tolerance - Terminology and key concept • We have a space of possibility between an error and a failure • Redundancy is the key concept ABCSG - Dependable Systems - 01/06/2006
Fault tolerance- Sequential systems • Recovery blocks - redundant algorithms • Retry blocks - redundant data Acceptance test examines the system state to verify that the behavior is acceptable ABCSG - Dependable Systems - 01/06/2006
Fault tolerance- Independent concurrent systems • N-Version programming - The parallel version of recovery blocks • N-Copy programming - The parallel version of retry blocks The decision mechanism must decide if one of the results can be considered correct ... and this is not an easy task ! - Multiple correct results, floating point precision ... - Exact majority voter, mean voter, consensus voter, etc... ABCSG - Dependable Systems - 01/06/2006
Fault tolerance- Competitive concurrent systems • Two or more processes are not aware of each other, but share some resources • They want to live in their own environment and a fault in one process should not affect the other processes • Transactions • Atomicity / Consistency / Isolation / Durability • Provide backward error recovery • Together with exception handling, transactions can be used to provide forward error recovery • In self-checking transactional objects methods are decorated with a pre and a post condition ABCSG - Dependable Systems - 01/06/2006
Fault tolerance- Cooperative concurrent systems • Several processes cooperate in executing a common job, and they are aware of each other • Conversation • Works like a transaction involving several processes • It’s an isolated environment for the participating processes, they are not allowed to communicate outside the conversation (information smuggling) • Ultimately everybody commits or rollback to the state from the beginning of the conversation - backward error recovery • Atomic actions • Is a conversation, but with the ability to do forward error recovery ABCSG - Dependable Systems - 01/06/2006
Fault tolerance- Hybrid systems • Models that support both competitive and corporative concurrency • Coordinated atomic actions • An atomic action, but with the possibility of the participants to access external objects • Atomic actions to control cooperative concurrency and coordinated error recovery • Transactions to control competitive concurrency to maintain the consistency of the shared resources in case of failures ABCSG - Dependable Systems - 01/06/2006
Coordinated Atomic Actions ... must be another day, I think time is up! ABCSG - Dependable Systems - 01/06/2006