1 / 16

Dependability Considerations in Distributed Control Systems

Dependability Considerations in Distributed Control Systems. Klemen Žagar , Cosylab. Dependability. A dependable system is one which the users may trust . Examples of dependable distributed systems : The Internet Power distribution grid Water supply

lowri
Download Presentation

Dependability Considerations in Distributed Control Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dependability Considerations in Distributed Control Systems Klemen Žagar, Cosylab ICALEPCS 2005, Geneve, Switzerland

  2. Dependability • A dependable system is one which the users may trust. • Examples of dependable distributed systems: • The Internet • Power distribution grid • Water supply • Dependability is very general term. Among others, it covers: • Availability: it is there when needed. • Reliability: it can work autonomously for a long period of time. • Maintainability: easily fixed when broken. • Safety: will not harm other equipment or personnel. • Security: unauthorized, possibly malicious, users can not gain control ICALEPCS 2005, Geneve, Switzerland

  3. Motivation • Nodes of a distributed system are like dominos • The domino effect: one falls, all may go down • May happen often, and takes a long time to rebuild • Thus, fault tolerance is important: • Improved mean-time-to-failure of the system as a whole • Lower mean-time-to-repair • Improved availability • Reduced maintenance effort • Fault tolerance in distributed control systems? ICALEPCS 2005, Geneve, Switzerland

  4. Research Objectives • Dependable Distributed Systems (DeDiSys)research project with the European Union. • What are the most frequent causes of faults in distributed control systems? • What mitigation mechanisms are available? • How to improve availability by trading it against constraint consistency? • What is constraint consistency in control systems? ICALEPCS 2005, Geneve, Switzerland

  5. Reliability • Reliability, , is the probability that a system will perform as specified for a given period of time. • Typically exponential: • Alternative measure is the mean time to failure (MTTF/MTBF): ICALEPCS 2005, Geneve, Switzerland

  6. Reliability of Composed Systems • Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent: • Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent: ICALEPCS 2005, Geneve, Switzerland

  7. Maintainability and Availability • Maintainability: how long it takes to repair a system after a failure. • The measure is mean time to repair (MTTR) • Availability: percentage of time the system is actually available during periods when it should be available. • Directly experienced by users! • Expressed in percent. In marketing, also with number of nines(e.g., 99.999% availability  unavailable 7 min/year). • Example: a gas station (working hours 6AM to 10PM – 16 hours) • Ran out of gas at 10AM (2h) • Pump malfunction at 2PM (2h) • Availability: 12h/16h = 75% ICALEPCS 2005, Geneve, Switzerland

  8. Research Methodology • Research in the context of the DeDiSys project • Collection of requirements from • DeDiSys project’s interest group members • Cosylab’s customers (e.g., ANKA, SLS, ...) • Identification of scenarios • ALMA Common Software (ACS) • EPICS • Geographical Information Systems • Definition of the architecture for a fault-tolerance naming service (FTNS) ICALEPCS 2005, Geneve, Switzerland

  9. Faults in Distributed Systems Consequences • Affected services are lost • Dependent systems malfunction • User interface doesn’t show actual status Node failures • A host crashes or a process dies • Volatile state is lost Link failures • A network link is broken • Results in two or more partitions • Difficult to distinguish from a host crash ICALEPCS 2005, Geneve, Switzerland

  10. Improving Hardware MTTF • Reduce the number of mechanical parts: • Solid-state storage instead of hard disks • Passive cooling of power supplies and CPUs (no fans) • High-quality or redundant power supplies • Replication: • network links • CPU boards • Remote reset (e.g., via power cycling) ICALEPCS 2005, Geneve, Switzerland

  11. Improving Software MTTF • Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled. • Ensure all resources are properly released when no longer needed (memory leaks, …) • Use a managed platform (Java, .NET) • Use auto-pointers (C++) • Avoid using heap storage on a per-transaction basis (may result in memory fragmentation); e.g., use free-lists • Restart a process in a controllable fashion (rejuvenation) • Isolate processes through inter-process communication • Recovery: • Recover state after a crash • Effective for host and process crashes • Automated repair ICALEPCS 2005, Geneve, Switzerland

  12. Decreasing MTTR • Foresee failures during design • The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. – Douglas Adams: Mostly Harmless • Provide good diagnostics • Alarms • Detailed description of where and when an error occurred • Logs • State-dump at failures • ADC buffers after a beam dump • Status of synchronization primitives • Memory dump • Automated fail-over • In combination with redundancy • Passive replica must have up-to-date state of the primary copy • Fault detection (network ping, analog signal, …) ICALEPCS 2005, Geneve, Switzerland

  13. Consistency/Availability Trade-Off Finance Banking Access control Corporate databases Consistency Availability Control systems Air-traffic control Fly-by-wire Drive-by-wire ICALEPCS 2005, Geneve, Switzerland

  14. Constraint Consistency in Control Systems • Constraints: rules that one or more objects must satisfy, for example: • If and only ifserverChannel.monitors.contains(client)then client.isSubscribedTo(serverChannel) • serverChannel.value == clientChannel.value • server.getFromDatabase(‘x’) == database.get(‘x’) • If client.referencesComponent(component)then component.isReferencedBy(client) • Can some constraints be temporarily relaxed in presence of faults? • If so, how to reconcile the system in a consistent state when faults are removed? ICALEPCS 2005, Geneve, Switzerland

  15. Future Work • DeDiSys: • Design and implementation (due: January 2007) • Validation (due: June 2007) • Possible inclusion of research findings in control system infrastructures: • ACS (e.g., replication of the manager and components) • EPICS (e.g., V4 fault-tolerance efforts of the EPICS community) • Inclusion in products: • The microIOC platform • Servers for Geographical Information Systems • Other high-availability products (telecommunications, automotive) • Know-how for consulting and development services ICALEPCS 2005, Geneve, Switzerland

  16. Conclusion • Distributed systems are inherently fragile • Fault tolerance is difficult to program • Should be addressed by infrastructure/middle-ware, but frequently isn’t • Comments/questions/contributions: klemen.zagar@cosylab.com ICALEPCS 2005, Geneve, Switzerland

More Related