790 likes | 935 Views
Membership and Clique Avoidance in TTP/C. Gunther Bauer, Michael Paulitsch Presented by Michael Sirivianos 02/01/2005. Overview. Membership in hard Real Time systems. What is it and why? Objectives TTP/C Overview Group membership. Clique Avoidance and Implicit Acks
E N D
Membership and Clique Avoidance in TTP/C Gunther Bauer, Michael Paulitsch Presented by Michael Sirivianos 02/01/2005
Overview • Membership in hard Real Time systems. What is it and why? • Objectives • TTP/C Overview • Group membership. • Clique Avoidance and Implicit Acks • Cluster Model-Fault Model • General Properties • Analysis • Conclusions
What is a RT Membership Service? • Safety critical RT systems use a bus system for communication. • A class C system offers the required FT. • A membership service gives timely and consistent info on the state of all nodes.
Why do we need it? • Membership service • establishes replica-deterministic agreement on all messages. • Prevents clique formation and certain classes of arbitrary faults • Allows global knowledge thus consistent and timely reaction to faults. • Membership is a critical function for the correct operation of the communication system. Should be placed below the app. Layer within the TTP layer.
TTP/C Overview • Services: • Message transport at specific time instances, with minimal jitter. • Fault-tolerant clock synchronization • Fault-tolerant membership management. • TDMA media access • Not necessarily equal sized time slots. • MEssage Description List contains TDMA schedule and groups several rounds of TDMA in cluster rounds. Statically assigned to all nodes.
TTP/C Overview, cont. • State of the distributed system (C-state). It comprises of: • Membership • The global time last frame B/C started. • Number of current TDMA slot • I (protocol state info) and X (protocol+app. data info) frames periodically transmit and carry C-state. • N (app. data info) frames. Determining consistency of C-state, by calculating CRC over both app. data and C-state.
TTP/C Overview, cont. • A node in the cluster, which is included in the schedule but has been inactive, can be integrated using global time and C-state info from the I/X frames.
Application software in Host Host Layer FTU CNI FTU Layer FTU Membership Basic CNI RM Layer Redundancy Management SRU Membership Clock Synchronization SRU Layer Data Link/Physical Layer Media Access: TDMA TTP Protocol Stack
TTP Protocol Stack (cont.) • Data Link/Physical Layer • Provide the means to exchange frames between the nodes • SRU Layer • Store the data fields of the received frames • RM Layer • Provide the mechanisms for the cold start of a TTP/C cluster • FTU Layer • Group two or more nodes into FTUs • Host Layer • Provide the application software • Basic CNI • A data-sharing interface between the RM layer and FTU layer • FTU CNI • The interface between FTU layer and Host Layer
Timeline in TTP/C • TDMA Cycle • One FTU sends message twice • The pattern is repeated when TDMA round ends • Cluster Cycle • Cluster cycle involves scheduling all possible messages and tasks
TTP/C Frame Structure N-Frame:
Paper Objective • Investigate properties of the Clique Avoidance algorithm. Performance analysis and study of interaction with Implicit Acks mechanism. • Study ability to resolve and detect conflicts in membership views of nodes within a cluster. • Provide time bounds for detecting and removing faulty members. • For their analysis, they assume arbitrary failures with bounded frequency.
Initial TTP/C Fault Hypothesis.Nodes. • Only one faulty node within the duration of aTDMA round. • A node may become faulty only after any previouslyfaulty node has either shut down or operates correctlyagain. • Transmission fault is consistent (nodes will consistently consider the respective frame faulty or correct) • A node does not send faulty or correct data outside its assigned sendings slots. • A node never hides its identity when sendingframes.
Initial TTP/C Fault Hypothesis.Network. • Only one channelcan be faulty during a TDMA slot. • A channel does not spontaneously create correct frames • A channel will deliver a frame either within someknown time bounds or never. • Bus Guardian transforms node errors, to comply with hypothesis. • Central Guardian a more cost effective solution. Handles several arbitrary faults.
Cluster Model - Extended Fault Hypothesis • No more failures besides the one that caused a cluster partition can occur two TDMA rounds before and after the failure. Thus, initially there is a single clique in which all nodes are assigned to. • Partition failure should cause both partitions to contain more than one member. Should affect both channels and be inconsistent. Contrary to the to initial hypothesis. • TTP/C can handle faults in violation of hypothesis, but in this case there is no guarantee it selects the correct clique.
Group Membership Protocol • Clique Avoidance algorithm • Removes faulty nodes from cluster • Prevents several coexisting cliques • Implicit Acknowledgement • The node inspects the membership list sent by the receiving nodes, to determine whether its message was correctly received.
Cluster Model - Slot n slots per TDMA round
Clique Avoidance • A reception is considered correct if the received C-state matches the local C-state and data are not corrupted. i.e transmission time is correct and memberships match after adding sender. • After a successful reception sender is added to receivers ML. • After incorrect reception, sender is removed from ML. • If the ML of the receiver differs only by the sender, then reception is successful. • Accept Counter is increased for every successful reception. • Failed Counter is increased for every incorrect reception. • If Failed counter >= Accept counter, node raises Ack Error and shuts down (freezes). • FC and AC are reset to 0 in each TDMA round.
Clique formation under the extended fault hypothesis • Prior to failure, there is consensus on membership. • Transient failure occurs at slot 0, when node A is transmitting. Asymmetric send fault. • As a result, several nodes in cluster correctly received A’s transmission and the rest did not. • Two cliques are formed. The one of members with membership that includes A and the one of members that do not include A.
Implicit Acks - Successors • After successful transmission, A increases AC. B checks frame for correctness. • A waits for expected message from B. • If reception was successful, B adds A in its ML and transmits a non corrupted message. • If ML’s are the same or B’s differ only by A , then A considers B its successor. • If ML’s are the same, then A is acked. (case 1). It increases its AC and adds B in ML. • If B’s ML differs by A, then A increases FC and removes B. B’s reception was not successful and B removed A. (case 2) • Otherwise A removes B from its ML. It increases FC unless B did not transmit at all. A goes to step 1.(case 3)
Implicit Acks - Successors • A waits for expected message from subsequent node C. • If A finds successor C that contains A in ML, then it is acknowledged. • B is assumed faulty and both FC and ML were updated correctly. • A increases AC and adds C in ML. (case 4) • However, if C’s ML does not include A, A considers himself erroneous. A removes itself from local list and adds both B, C. Increases AC. It has the same ML with B, and C (case 5)
Implicit Acks - Defector • In case 5, A changes clique membership. Becomes defector. • Other nodes become aware of a defector only in its next sending round, by the transmitted ML. • If defector becomes implicitly acknowledged, then it is no longer defector. If not, it freezes due to CA.
Partition failure. Slot 4Preparation Phase FC > AC Node A4 Freezes !
Partition failure. Slot 6Preparation Phase FC > AC Node A1 Freezes!