170 likes | 403 Views
Dependability Considerations in Distributed Control Systems. Klemen Žagar , Cosylab. Dependability. A dependable system is one which the users may trust . Examples of dependable distributed systems : The Internet Power distribution grid Water supply
E N D
Dependability Considerations in Distributed Control Systems Klemen Žagar, Cosylab ICALEPCS 2005, Geneve, Switzerland
Dependability • A dependable system is one which the users may trust. • Examples of dependable distributed systems: • The Internet • Power distribution grid • Water supply • Dependability is very general term. Among others, it covers: • Availability: it is there when needed. • Reliability: it can work autonomously for a long period of time. • Maintainability: easily fixed when broken. • Safety: will not harm other equipment or personnel. • Security: unauthorized, possibly malicious, users can not gain control ICALEPCS 2005, Geneve, Switzerland
Motivation • Nodes of a distributed system are like dominos • The domino effect: one falls, all may go down • May happen often, and takes a long time to rebuild • Thus, fault tolerance is important: • Improved mean-time-to-failure of the system as a whole • Lower mean-time-to-repair • Improved availability • Reduced maintenance effort • Fault tolerance in distributed control systems? ICALEPCS 2005, Geneve, Switzerland
Research Objectives • Dependable Distributed Systems (DeDiSys)research project with the European Union. • What are the most frequent causes of faults in distributed control systems? • What mitigation mechanisms are available? • How to improve availability by trading it against constraint consistency? • What is constraint consistency in control systems? ICALEPCS 2005, Geneve, Switzerland
Reliability • Reliability, , is the probability that a system will perform as specified for a given period of time. • Typically exponential: • Alternative measure is the mean time to failure (MTTF/MTBF): ICALEPCS 2005, Geneve, Switzerland
Reliability of Composed Systems • Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent: • Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent: ICALEPCS 2005, Geneve, Switzerland
Maintainability and Availability • Maintainability: how long it takes to repair a system after a failure. • The measure is mean time to repair (MTTR) • Availability: percentage of time the system is actually available during periods when it should be available. • Directly experienced by users! • Expressed in percent. In marketing, also with number of nines(e.g., 99.999% availability unavailable 7 min/year). • Example: a gas station (working hours 6AM to 10PM – 16 hours) • Ran out of gas at 10AM (2h) • Pump malfunction at 2PM (2h) • Availability: 12h/16h = 75% ICALEPCS 2005, Geneve, Switzerland
Research Methodology • Research in the context of the DeDiSys project • Collection of requirements from • DeDiSys project’s interest group members • Cosylab’s customers (e.g., ANKA, SLS, ...) • Identification of scenarios • ALMA Common Software (ACS) • EPICS • Geographical Information Systems • Definition of the architecture for a fault-tolerance naming service (FTNS) ICALEPCS 2005, Geneve, Switzerland
Faults in Distributed Systems Consequences • Affected services are lost • Dependent systems malfunction • User interface doesn’t show actual status Node failures • A host crashes or a process dies • Volatile state is lost Link failures • A network link is broken • Results in two or more partitions • Difficult to distinguish from a host crash ICALEPCS 2005, Geneve, Switzerland
Improving Hardware MTTF • Reduce the number of mechanical parts: • Solid-state storage instead of hard disks • Passive cooling of power supplies and CPUs (no fans) • High-quality or redundant power supplies • Replication: • network links • CPU boards • Remote reset (e.g., via power cycling) ICALEPCS 2005, Geneve, Switzerland
Improving Software MTTF • Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled. • Ensure all resources are properly released when no longer needed (memory leaks, …) • Use a managed platform (Java, .NET) • Use auto-pointers (C++) • Avoid using heap storage on a per-transaction basis (may result in memory fragmentation); e.g., use free-lists • Restart a process in a controllable fashion (rejuvenation) • Isolate processes through inter-process communication • Recovery: • Recover state after a crash • Effective for host and process crashes • Automated repair ICALEPCS 2005, Geneve, Switzerland
Decreasing MTTR • Foresee failures during design • The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. – Douglas Adams: Mostly Harmless • Provide good diagnostics • Alarms • Detailed description of where and when an error occurred • Logs • State-dump at failures • ADC buffers after a beam dump • Status of synchronization primitives • Memory dump • Automated fail-over • In combination with redundancy • Passive replica must have up-to-date state of the primary copy • Fault detection (network ping, analog signal, …) ICALEPCS 2005, Geneve, Switzerland
Consistency/Availability Trade-Off Finance Banking Access control Corporate databases Consistency Availability Control systems Air-traffic control Fly-by-wire Drive-by-wire ICALEPCS 2005, Geneve, Switzerland
Constraint Consistency in Control Systems • Constraints: rules that one or more objects must satisfy, for example: • If and only ifserverChannel.monitors.contains(client)then client.isSubscribedTo(serverChannel) • serverChannel.value == clientChannel.value • server.getFromDatabase(‘x’) == database.get(‘x’) • If client.referencesComponent(component)then component.isReferencedBy(client) • Can some constraints be temporarily relaxed in presence of faults? • If so, how to reconcile the system in a consistent state when faults are removed? ICALEPCS 2005, Geneve, Switzerland
Future Work • DeDiSys: • Design and implementation (due: January 2007) • Validation (due: June 2007) • Possible inclusion of research findings in control system infrastructures: • ACS (e.g., replication of the manager and components) • EPICS (e.g., V4 fault-tolerance efforts of the EPICS community) • Inclusion in products: • The microIOC platform • Servers for Geographical Information Systems • Other high-availability products (telecommunications, automotive) • Know-how for consulting and development services ICALEPCS 2005, Geneve, Switzerland
Conclusion • Distributed systems are inherently fragile • Fault tolerance is difficult to program • Should be addressed by infrastructure/middle-ware, but frequently isn’t • Comments/questions/contributions: klemen.zagar@cosylab.com ICALEPCS 2005, Geneve, Switzerland