1 / 8

Split Brain Detection Version 00

Split Brain Detection Version 00. Nigel Bragg September 4 th , 2012. Introduction from :- new-haddock-RNNI-split-brain-avoidance-1210-v1.pdf. A “split-brain” situation arises when :

Download Presentation

Split Brain Detection Version 00

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Split Brain DetectionVersion 00 Nigel Bragg September 4th , 2012

  2. Introductionfrom :- new-haddock-RNNI-split-brain-avoidance-1210-v1.pdf A “split-brain” situation arises when : • In normal operation, two (or more) devices depend upon a control path to coordinate their operation such that they function as a single virtual entity with a single identity; and • Upon failure of the common control path, the two (or more) devices operate independently but • Each assumes the full functionality of the single virtual entity; and/or • Each continues to use the identity of the single virtual entity. • Split-brain issues are avoided if the solution is designed so that conditions 2a and 2b do not occur. • There are two general approaches to achieving this.

  3. Approach A: Easy Split-Brain Avoidance • Prevent condition 2b by: • Assuring that all devices, or all but one pre-determined device, always switch to a unique identity (different from the identity of the single virtual device) upon failure of the control path. • Prevent condition 2a by either: • Assuring one and only one device assumes the full functionality of the single virtual device upon failure of the control path; or • Assuring that each device deterministically assumes a subset of the functionality that does not overlap or conflict with the subset assumed by another device. • Link Aggregation, using the standard protocol without any changes running across the NNI, achieves this. • Characterized as “easy” because this approach does not require distinguishing whether a node failure or a link failure resulted in the loss of the control path.

  4. Approach B: Hard Split-Brain Avoidance • Prevent condition 2b by: • Assuring that one and only one device continues to operate with the identity of the single virtual device upon failure of the control path. • Note that with hard split-brain avoidance there is always one device continuing to operate with the identity of the single virtual device, whereas with easy split-brain avoidance there may or may not be a device that continues to operate with the identity of the single virtual device. • Prevention of condition 2a: • The options for prevention of condition 2a are the same for both easy and hard split-brain avoidance. This is because once the identity issue is resolved, there are many possible ways to resolve the division of functionality. • Characterized as “hard” because this approach requires distinguishing whether a node failure or a link failure resulted in the loss of the control path.

  5. The reference model :- Two Systems with Distributed Aggregation System A System B Port Port Port Port Port Port Port Port (possible) Network Link Intra-Portal Link (could be virtual) Network Link Network Link Gateway Link (virtual) Gateway Link (virtual) Emulated System C Port Port Port Port Port Port Each Network Port on System A advertises: Actor_System = A Actor_Key = Ax A Port ID for each port unique within A Each Network Port on System B advertises: Actor_System = B Actor_Key = Bx A Port ID for each port unique within B Each (non Gateway) Port on System C advertises: Actor_System = C Actor_Key = Cn A Port ID for each port unique within C Where Cn is the same value on all of the ports,

  6. Split Brain Detection (1) ? 2 It is desirable to solve the “hard” split brain problem toensure that a portal continues to operate as a single virtual device whichever node within it might fail, • which in turn requires that we have a robust way of determining that a node has failed, and not just been partially disconnected. Assertion • it is necessary to check for node reachability by allpossible paths before being entitled to regard it as dead So • normal “keep-alive” can be limited to run on the inter-DAS link (1), but if that fails (e.g. from the PoV of A seeking to establish the reachability of B above) • we need to probe for network connectivity between A and B (2), and • we need to ascertain reachability of B via the DRNI (3) If B is unreachable by all routes, it doesn’t matter if it has failed or not. A B X X 1 C DRNI 3 DRNI W X X Y Z

  7. Split Brain Detection (2) ? 2 If inter-DAS link (1), fails (e.g. from the PoV of Aseeking to establish the reachability of B above) • We need to probe for network connectivity between A and B (2) – should be straightforward : • LBM from MEP(Sys ID A)  MEP(Sys ID B) ? • We need to ascertain the reachability of B via the DRNI (3) : • it is not clear now to probe B directly from A(and be sure to use all the links (3) of the DRNI), • W may believe all links are a distributed LAG – poisoned reverse, so propose : • A could harvest from W the full list of Port IDs being offered by C : • and need to request that that this information is “fresh”, • but the mechanism must also handle a dual-homed legacy real node W : • is there a mechanism to allow this ? What then ? A B X 1 X C DRNI 3 DRNI W X X Y Z

  8. Split Brain Detection (3) ? 2 What then ? • “Assure that one and only one device continues to operate with the identity of the single virtual device on failure of the control path”. • If a Node sees zero connectivity to its “mate” Node, it picks up the DRNI identity C; • If a Node has lost the inter-DAS link (1) andconnectivity via its own network (2), • but some physical connectivity to its “mate” is advertised by W over the DRNI (3), • or that information is not available : – and so we must assume that connectivity exists, then the network behind A and B is severed : • Node A reverts to its “real” LAG parameters as A, • or would it be less disruptive to run its part of Cusing “last agreed parameters” ? • Else use own network (2) to negotiate roles, or exchange DRCP messages A B X 1 X C DRNI 3 DRNI W X X Y Z 123 o 000 a) 001 b) 010 c) 011 c)

More Related