Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns

Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns Matthias Tichy, Daniela Schilling, Holger Giese

Application Example - RailCab

Application Example - RailCab • Vision: Combine • individual transport • cost-effectiveness of public transport • Autonomously operating shuttles using linear drive (maglev train) • Build contact-free convoys for energy savings • Information about the exact position necessary • Computation based on the stator waves

Example service structure: Problem: position calculation service must be highlyreliable Motivation :StatorWaveSensor :PositionCalculation :ConvoyController

Problem: position calculation service must be highlyreliable Solution: self-healing, fault-tolerant software Application of software fault tolerance patterns, which capture: Abstract service structure of fault tolerance techniques Deployment restrictions (explicitly taking fault tolerance into account) Automatic deployment Initial deployment Self-healing by deployment reconfiguration during runtime Contents

Fault Tolerance Patterns • Capture the abstract structure of a well known fault tolerance technique • Example: Triple Modular Redundancy (TMR) Triple Modular Redundancy :Service1 :Provider :Multiplier :Service2 :Voter :User :Service3

Fault Tolerance Patterns - Deployment Arthur • Naïve, unrestricted deployment may yield unwanted results • Deployment must respect restrictions imposed by the fault tolerance technique (redundancy, heterogeneity) • Instead of manual deployment, enrich fault tolerance patterns by deployment restrictions <<deploy>> <<deploy>> <<deploy>> <<Service3>> :PositionCalculation <<Service1>> :PositionCalculation <<Service2>> :PositionCalculation

Deployment restrictions for the TMR pattern: Deployment restrictions Avoid single-point-of-failure of voter / multiplier -> Deploy voter and user to same node (if the user fails, the failure of the voter is no problem) Node1: Node2: :Provider :Multiplier :Voter :User Avoid crash failures -> Deploy redundant services to distinct nodes :Service1 :Service2 :Service3 Node3: Node4: Node5: Heterogeneous hardware platform -> require different CPU { Node3.CPU  Node4.CPU  Node4.CPU  Node5.CPU  Node3.CPU  Node 5.CPU }

Fault Tolerance Patterns :StatorWaveSensor :PositionCalculation :ConvoyController Application of TMR pattern <<Service1>> pc1:Position Calculation <<Provider>> sen: StatorWaveSensor <<Multiplier>> mult: Multiplier <<Service2>> pc2:Position Calculation <<Voter>> vot:Voter <<User>> cc:Convoy Controller <<Service3>> pc3:Position Calculation

Use a standard constraint solver <<Service1>> pc1:Position Calculation <<Provider>> sen:Sensor <<Multiplier>> mult: Multiplier <<Service2>> pc2:Position Calculation <<Voter>> vot:Voter <<User>> cc:Convoy <<Service3>> pc3:Position Calculation Avalon Uther Gorlois Taliesin Gareth Arthur Compute a correct / reliable deployment Deployment

Mapping to a standard constraint solver Compute a correct / reliable deployment Services (i) 1: Service pc1 is deployed to node Gorlois Nodes (j) • Variables xi,jє{0,1} • Constraint: (each service is deployed to exactly one node)  i  Services:∑xi,j = 1 j

Restriction: Services must be executed on same node Graphically: Constraint: Compute a correct / reliable deployment <<deploys>> <<deploys>> <<Voter>> vot:Voter <<User>> cc: ConvoyController jNodes: (xvot,j+xcc,j) =2 or (xvot,j+xcc,j) =0

Initial deployment: Graphically: Compute a correct / reliable deployment pc1 pc2 pc3 Avalon Gareth Taliesin sen mul vot cc Gorlois Uther Arthur

Due to TMR, one crashed redundant service is tolerable Second crash is not tolerable Therefore: Self-heal by restarting the crashed service on a working node. But on which node? Compute it timely! Repair Gorlois <<Service1>> pc1:PositionCalculation <<deploys>> Uther <<Service2>> pc2:PositionCalculation <<deploys>> Arthur <<Service3>> pc3:PositionCalculation <<deploys>>

Solve restricted deployment problem Repair Only consider changed values Gorlois <<deploys>> Crash Failure <<Service1>> pc1:PositionCalculation • Fix variables xi,jof unaffected service/node combinations • Free variables xi,jfor affected service/node combinations Service migration should be avoided!

No solution found -> relax the problem First round  Service migration necessary Second round  Third round  Repair • Fixed variables • Free variables

Current work Tool Support

Fault tolerance patterns capture fault tolerance techniques Contain: Structure + Deployment restrictions Graphical specification Easy application Automatic deployment Initial deployment Self-healing by deployment reconfiguration during runtime Future Work Heterogeneous service restrictions in pattern application Synthesize behavior of voter and multiplier services Use fault tolerance application knowledge for automatic fault tree analysis Conclusions / Future Work .de www.

Questions? .de www.

Restriction: Services must be deployed on distinct nodes Graphically: Constraint: Compute a correct / reliable deployment <<Service1>> pc1: PositionCalculation <<deploys>> <<Service2>> pc2: PositionCalculation <<deploys>>  j  Nodes: (xpc1,j+xpc2,j) <=1

Restriction: CPU of nodes for redundant service must be distinct Constraint: Compute a correct / reliable deployment xs1,j2 xs2,j2  xs3,j3  j1.cpu  j2.cpu  j3.cpu

Multiple applications of TMR Transform to a multiple stage arrangement with tripled voter and multiplier Deployment restrictions are transformed too Multistage <<Multiplier>> :PCMultiplier <<Voter>> :Voter <<User>> <<Service1>> cc:Convoy <<Service1>> <<Provider>> pc1:Position Calculation <<Multiplier>> :PCMultiplier <<Voter>> :Voter <<Provider>> sen:Sensor <<Multiplier>> mult: Multiplier <<Service2>> <<Provider>> pc2:Position Calculation <<User>> <<Service2>> cc:Convoy <<Service3>> <<Provider>> pc3:Position Calculation <<Multiplier>> :PCMultiplier <<Voter>> :Voter <<User>> <<Service3>> cc:Convoy

Already working: Graphical specification of fault tolerance patterns and deployment restrictions Automatic mapping to ILOG solver software Next: Runtime support Tool Support

System without fault-tolerance, node crash (1) fault tolerance pattern Introduce redundancy (TMR) but deploy two redundant service on one node, crash of that node Better deploy redundant services to distinct nodes (2) deployment restrictions for fault tolerance pattern Example with distinct nodes, crash failure, everything ok (3) which nodes Now self-heal the system by restarting a crashed failure on a new node (4) which node Motivation

Goals: Easy application of fault tolerance techniques Deployment issues taken into account Easy deployment specification Reusable deployment specification Motivation

Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns

Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns

Presentation Transcript

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance

Fault Tolerance in Embedded Systems

Fault Tolerance

Fault tolerance made easy Patterns for fault tolerant software design applied

Fault tolerance

Fault Tolerance

Fault Tolerance

Dependable Intrusion Tolerance

Impact: Fault Tolerance and High Confidence Embedded Systems Design

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Impact: Fault Tolerance and High Confidence Embedded Systems Design

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems