1 / 79

Automatic Protection Switching

Automatic Protection Switching. Yaakov (J) Stein CTO RAD Data Communications. Mar 2012. Course Outline General protection switching principles Examples of protection mechanisms SONET/SDH Ethernet linear protection Ethernet ring protection MPLS fast reroute

andrew
Download Presentation

Automatic Protection Switching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AutomaticProtection Switching Yaakov (J) Stein CTO RAD Data Communications Mar 2012

  2. Course Outline • General protection switching principles • Examples of protection mechanisms • SONET/SDH • Ethernet linear protection • Ethernet ring protection • MPLS fast reroute • MPLS-TP APS

  3. General principles Definition References Traffic types Network topologies Triggers Protection classes Entities Protection types Signaling

  4. Definition Automatic Protection Switching (APS) is a functionality of carrier-grade transport networks is often called resilience since it enables service to quickly recover from failures is required to ensure high reliability and availability APS includes : • detection of failures (signal fail or signal degrade) on a working channel • switching traffic transmission to a protection channel • selecting traffic reception from the protection channel • (optionally) reverting back to the working channel once failure is repaired Automatic means uses (at most) control plane protocols – no management layer or manual operations needed

  5. Some useful references G.808.1 – generic linear protection G.808.2 – generic ring protection (not yet written) G.841 and G.842 – SDH G.774.3/4/9/10 – SDH protection management G.870 and G.873.1 – OTN G.8031 – Ethernet linear protection G.8032 – Ethernet ring protection G.8131 – T-MPLS APS Y.1720 – MPLS I.630 – ATM M.495 – analog signal protection G.781 – clock selection (can be used to protect synchronization) RFC 4090 – MPLS Fast ReRoute RFC 6372 – MPLS-TP Survivability Framework RFC 6378 – MPLS-TP Linear Protection

  6. Traffic types In a network with APS capabilities, there are three types of traffic : • protected traffic • traffic that may be rapidly switched to protection channel • at any time it may be on the working channel or protection channel • NonpreemptibleUnprotected Traffic (NUT) • noncritical traffic that does not require protection mechanism • not affected by protection mechanism • somewhat less expensive to customer • extra (preemptible) traffic • best effort background traffic that runs on protection channel • preempted (blocked) when protection channel is needed • very inexpensive to customer

  7. Network topologies APS can be defined for any topology with redundant links e.g., for tree topologies no protection is possible We will often discuss protection of individual links However, there are two topologies that are of particular interest : • rings • protection is natural for rings • although there are other reasons for using rings as well • rings are so important that protection for other topologies • is often called linear protection • dense meshes • for this topology multiple local bypasses can be preconfigured • protection switching is similar to routing change, but faster • often called “Fast ReRoute” (FRR)

  8. Triggers Protection switching is usually triggered by a failure although the operator may manually force a protection switch A failure is declared when a fault condition persists long enough for the ability to perform the required function to be considered terminated Failures are Signal Fail (SF) or Signal Degrade (SD) (of various types) and may be : • detected by physical layer • indicated by signaling (e.g. AIS) • detected by OAM mechanisms When there is no SF or SD, the state is called No Request (NR)

  9. Switching time (1) SONET/SDH protection switching takes place in under 50 ms Regarding multiplex section shared protection rings, G.841 states : The following network objectives apply: 1) Switch time – In a ring with no extra traffic, all nodes in the idle state (no detected failures, no active automatic or external commands, and receiving only Idle K-bytes), and with less than 1200 km of fibre, the switch (ring and span) completion time for a failure on a single span shall be less than 50 ms. On rings under all other conditions, the switch completion time can exceed 50 ms (the specific interval is under study) to allow time to remove extra traffic, or to negotiate and accommodate coexisting APS requests. while for linear VC trail protection, it says : The following network objectives apply: 1) Switch time – The APS algorithm for LO/HO VC trail protection shall operate as fast as possible. A value of 50 ms has been proposed as a target time. Concerns have been expressed over this proposed target time when many VCs are involved. This is for further study. Protection switch completion time excludes the detection time necessary to initiate the protection switch, and the hold-off time. There are similar statements in other clauses as well

  10. Switching time (2) This 50 ms time has become the golden standard and new protection schemes are expected to meet this objective However, studying the literature that lead up to SONET/SDH standards shows that the objective was to attain the minimum possible time for the sum of • persistent (i.e. non-transient) failure detection • speed of light propagation • signaling protocol time • regaining sync alignment and 50 ms was the minimum that was considered practical ! Many modern standards have “built in” 50 ms and much marketing literature boasts “faster than 50 ms” But there is really nothing special about 50 ms • 50 ms gaps in voiced speech are noticeable, • but not fatal if infrequent • 50 ms of data at high rates can not be stored and later forwarded • timing circuits can withstand much more than 50 ms without clock

  11. Protection classes It is useful to distinguish two different protection classes • path protection (AKA trail protection, end-to-end protection) • when a failure is detected on the end-to-end path we switch to an alternative end-to-end path • the failure is usually detected by end-to-end OAM • local protection (AKA local restoration, SNC protection, bypass, detour) • we protect individual network elements, links, or groups of same • when such an entity fails only that local entity is bypassed • the failure may be detected by link OAM or physical layer means

  12. APS entities (1) The following entities are important in APS • working channel – channel used when no failure exists • protection channel – channel used when a failure exists • head-end – entity transmitting data to working/protection channel • tail-end – entity receiving data from the working/protection channel Note: we will usually consider traffic to be bidirectional so that the head-end for one direction is the tail-end for the opposite direction working channel protection channel tail-end head-end

  13. APS entities (2) • Bridge – function at head-end that connects traffic (including extra traffic) to the working and protection channels • Selector –function at tail-end that extracts traffic (perhaps extra traffic) from the working or protection channel • APS signaling channel – channel used to communicate between head-end and tail-end for APS purposes • Trail termination –function responsible for failure detection including injection and extraction of OAM working channel tail-end (selector) head-end (bridge) protection channel signaling channel

  14. Revertive operation Reversion means returning to use the working channel after the failure has been rectified Protection mechanisms can be revertive or nonrevertive Revertive mechanisms may be preferable • when the working channel has better performance (free BW, BER, delay) • when there are frequent switches (easier to manage) • when there is extra traffic but nonrevertive also has advantages • only one service disruption due to protection switching • may be simpler to implement

  15. Uni/bi-directional We will usually consider bidirectional traffic but even then the failures can be uni- or bi- directional and for unidirectional failures there can be uni- or bi- directional switching unidirectional failure bidirectional protection unidirectional protection bidirectional failure working channel working channel protection channel in use protection channel in use working channel working channel protection channel in use protection channel

  16. Uni- / bi- directional switching Unidirectional switching may be advantageous • for 1+1 - faster and no signaling channel is needed • no unnecessary service disruption for direction without failure • higher chance of protection under multiple failures • easier to implement for local protection • maintains extra traffic in direction without failure But bidirectional may be preferable • easier management since directions traverse same network elements • does not disrupt delay balance between direction • may simplify repair since failed spans are unused

  17. Protection types We distinguish several different protection types • 1+1 • 1:1 • 1:n • m:n • (1:1)n Each type has its applicability, advantages, and disadvantages and there are trade-offs between • simplicity • BW consumption • protection switch time • signaling requirements

  18. 1+1 protection Simplest and fastest form of protection but wasteful - only 50% of actual physical capacity is used Head-end bridge always sends data on both channels Tail-end selector chooses channel to use (based on BER, dLOS, etc.) For unidirectional1+1 switching there is no need for APS signaling If non-revertive there is no distinction between working and protection channels channel A channel B

  19. 1:1 protection Head-end bridge usually sends data on working channel When failure detected it starts sending data over protection channel and tail-end needs to select the protection channel When not in use, protection channel can be used for extra traffic However, since failure is detected by tail-end, APS signaling is needed Protection channel should have OAM running to ensure its functionality working channel extra traffic protection channel APS signaling

  20. 1:n protection One protection channel is allocated for n working channels Only can protect one working channel at a time but improbable that more than 1 working channel will simultaneously fail Only 1/(n+1) of total capacity is reserved for protection working channels protection channel

  21. m:n protection To enable protection of more than 1 channel m protection channels are allocated for n working channels (m < n) m simultaneous failures can be protected Less protection capacity dedicated than for n times 1:1 When failure detected, 1 of the m protection channels need to be assigned and signaled High complexity but conserves resources working channels protection channels

  22. (1:1)n protection This is like n times 1:1 but the n protection channels share bandwidth Only 1 failed working channel can be protected This is different from 1:n since • n protection channels are preconfigured • n working channels need not be of the same type Protection bandwidth must be at least that of the largest working channel

  23. APS algorithm We have seen that protection switching is a tricky business So it is not surprising that network elements that support APS run an APS algorithm This algorithm inputs : • configuration (protection type, revertive?, available channels, …) • failure indications (NR, SF, SD) • operator commands • APS signaling (more on that soon) and makes switching decisions The algorithm maintains state information for head-end and tail-end APS algorithms are detailed in standards documents

  24. Priority Not every failure event / operator command results in a protection switch For example in 1:n protection the protection channel may already be in use ! Conflicts are resolved by assigning priorities to events/commands When an event is detected or a command received the APS algorithm will not act if an event/command or equal or higher priority is already in effect True failure conditions usually have higher priority than manual commands

  25. Timers Even failure events with priority are not acted upon immediately to do so would cause unnecessary switches after transient defects The APS algorithm may maintains several timers, such as • Holdoff timers • the time between detection of a SF or SD event and the APS algorithm acting upon this even • the algorithm usually used is called “peek twice” i.e., the condition is checked again after the timer expires • Wait To Restore timer • for revertive switching, the time between detection of the failure being cleared and the APS algorithm acting upon this event • also used in SDH optimized bidirectional 1+1 (nonrevertive) • Guard timer • for rings – blockout time during which APS messages are ignored (since they may be old and outdated)

  26. APS signaling In all types except unidirectional 1+1, some APS signaling is needed APS signaling is used to synchronize between head-end and tail-end It is critical that head-end and tail-end always be in the same state Example messages include : • No Request (NR) • by tail-end to inform head-end of Signal Failure (SF) • by head-end to confirm the event’s priority • by head-end to report the particular protection channel • by head-end to inform tail-end of Reverse (bidirectional) Request (RR) • by tail-end after failure cleared to Wait To Restore (WTR) • by tail-end after failure cleared to Do Not Revert (DNR) for nonrevertive

  27. APS signaling phases When APS signaling is used, it needs to be as rapid as possible Depending on the scenario it may be • 1-phase tailhead (fastest) • tail-end informs head-end of failure • both ends uniquely know the protection channel to be used • only for 1+1 and unidirectional-(1:1)n (including 1:1) • 2-phase 1) tailhead 2) headtail • tail-end informs head-end of failure • head-end signals that it has switched to protection channel • not for bidirectional-1:n or m:n • 3-phase 1) tailhead 2) headtail 3) tailhead (slowest) • works for all protection types (including m:n)

  28. Examples of 1-phase Example of when 1-phase signaling is possible is 1:1 or (1:1)n 1. upon detection of failure the tail-end sends SF to the head-end and immediately changes its selector (blind switch) upon receipt the head-end changes the bridge setting (no priority is checked) 1-phase can also be used for bidirectional 1:1 1. upon detection of failure the tail-end sends SF to the head-end and immediately changes both its selector and bridge upon receipt the head-end changes its bridge and selector

  29. Example of 2-phase 2-phase is useful for unidirectional 1:n with priority checking 1. upon detection of failure the tail-end sends SF to the head-end but does not change its selector 2. the head-end checks priority sends confirmation to tail-end (with identity of working channel) the bridge setting is changed 3. the tail-end changes its selector

  30. Example of 3-phase 3-phase signaling is imperative for bidirectional 1:n 1. upon detection of failure the tail-end sends SF to the head-end but does not change its selector 2. the head-end checks priority, and sends confirmation to tail-end head-end changes its bridge setting and also sends a reverse request 3. the tail-end changes selector checks priority and sends confirmation to head-end tail-end changes its bridge setting (as head-end of opposite direction) head-end receives confirmation and changes its selector

  31. protected trail unprotected trail For G.805 buffs to add 1+1 trail protection to a trail - expand a trail termination function we use a special transport processing function - the protection switch the unprotected TTs report status to the protection switch

  32. SONET/SDH APS

  33. SONET protection ? SONET/SDH networks need to be highly reliable (five nines) Down-time should be minimal (less than 50 msec) So systems must repair themselves (no time for manual intervention) Upon detection of a failure (dLOS, dLOF, high BER) the network must reroute traffic (protection switching) from working channel to protection channel SDH APS is unidirectional SDH APS may be revertive working channel protection channel tail-end NE head-end NE

  34. regenerator ADM ADM Path Termination Line Termination Section Termination Line Termination Path Termination path line line line (MS section) section section section section SONET/SDH layers Between regenerators there are sections (regenerator sections) Between ADMs there are lines (multiplex sections) Between path terminations there are paths Protection can be at OC-n level (different physical fibers) or at STM/VC level or end-to-end path (trail protection)

  35. Line APS 90 columns Synchronous Payload Envelope 3 rows TOH consists of • 3 rows of section overhead - frame sync, trace, EOC, … • 6 rows of line overhead - pointers, SSM, FEBE, and Line APS signaling uses bytes K1 and K2 9 rows 9 rows 6 rows TOH

  36. J1 B3 C2 G1 F2 H4 F3 K3 N1 POH HO Path APS POH is responsible for type, status, path performance monitoring, VCAT, trace HO Path APS signaling uses 4 MSBs of byte K3

  37. LO Path APS 1 30 59 87 VC OH is responsible for Timing, PM, REI, … LO Path APS signaling is 4 MSBs of byte K4 V5 J2 V1 V2 N2 V3 K4 V4 VC OH

  38. head-end bridge tail-end bridge working channel protection channel signaling channel How does it work? Head-end and tail-end NEs have bridges (muxes) Head-end and tail-end NEs maintain bidirectional signaling channel Signaling is contained in K bytes of protection channel For line APS • K1 – tail-end status and requests • K2 – head-end status

  39. Linear 1+1 protection Can be at OC-n level (different physical fibers) or at STM/VC level (SubNetworkConnection Protection) or end-to-end path (called trail protection) Head-end bridge always sends data on both channels Tail-end chooses channel to use based on BER, dLOS, etc. No need for signaling If non-revertive there is no distinction between working and protection channels working channel protection channel tail-end NE head-end NE

  40. Linear 1:1 protection Head-end bridge usually sends data on working channel When tail-end detects failure it signals (using K1) to head-end Head-end then starts sending data over protection channel When not in use protection channel can be used for (discounted) extra traffic (pre-emptible unprotected traffic) May be at any layer (but only OC-n level protects against fiber cuts) working channel extra traffic protection channel

  41. working channels protection channel Linear 1:N protection In order to save BW we allocate 1 protection channel for every N working channels N limited to 14 4 bits in K1 byte from tail-end to head-end • 0 protection channel • 1-14 working channels • 15 extra traffic channel

  42. Two fiber vs. Four-fiber rings Ring based protection is popular in North America (100K+ rings) Full protection against physical fiber cuts Simpler and less expensive than mesh topologies Protection at line (multiplexed section) or path layer Four-fiber rings fully redundant at OC level can support bidirectional routing at line layer Two-fiber rings support unidirectional routing at line layer 2 fibers in opposite directions

  43. B B A-B A-B B-C B-A A A C-B B-A C Unidirectional vs. bidirectional Unidirectional routing working channel B-A same direction (e.g. clockwise) as A-B management simplicity: A-B and B-A can occupy same timeslots Inefficient: waste in ring BW and excessive delay in one direction Bidirectional routing A-B and B-1 are opposite in direction both using shortest route spatial reuse: timeslots can be reused in other sections

  44. UPSR vs. BLSR (MS-SPRing) Unidirectional Bidirectional Path switching Line switching Two-fiber Four-fiber UPSR BLSR Of all the possible combinations, only a few are in use Unidirectional (routing) Path Switched Rings protects tributaries extension of 1+1 to ring topology Bidirectional (routing) Line Switched Rings (two-fiber and four-fiber versions) called Multiplex Section Shared Protection Ring in SDH simultaneously protects all tributaries in STM extension of 1:1 to ring topology

  45. UPSR Working channel is in one direction protection channel in the opposite direction All path traffic is “added” in both directions (1+1) decision as to which to use is made at drop point (no signaling) Normally non-revertive, so effectively two diversity paths Good match for access networks 1 access resilient ring less expensive than fiber pair per customer Inefficient for core networks no spatial reuse every signal in every span in both directions node needs to continuously monitor every tributary to be “dropped” 2 rings SONET ADM

  46. BLSR Switch at line level – less monitoring When failure detected tail-end NE signals head-end NE Works for unidirectional/bidirectional fiber cuts, and NE failures Two-fiber version half of OC-N capacity devoted to protection only half capacity available for traffic Four-fiber version full redundant OC-N devoted to protection twice as many NEs as compared to two-fiber wrap-around 2 rings Example recovery from unidirectional fiber cut

  47. Ethernet linear APS STP LAG G.8031

  48. STP The original Spanning Tree Protocol automatically removed loops from arbitrary networks (with loops) However, its convergence was very slow (about a minute) STP can not be used as a protection mechanism since its reconvergence time is very long due to a cumbersome protocol and long holdoff timer settings An evolutionary update called Rapid STP 802.1w was incorporated into 802.1D-2004 clause 17 that converges in about the same time as STP but can reconverge after a topology change in less than 1 second RSTP can be used to detect failures and reconverge and thus can be used as a primitive protection mechanism However, the switching time will be many tens of ms to 100s of ms

  49. Use of LAG Ethernet “link aggregation” (AKA bonding, Ethernet trunk, inverse mux, NIC teaming) enables bonding several ports together as single uplink Defined by 802.3ad task force and folded into 802.3-2000 as clause 43 Binding of ports to Link Aggregation Groups (LAGs) distributed via Link Aggregation Control Protocol (LACP) LACP uses slow protocol frames (up to 5 per second) Links may be dynamically added/removed from LAG and LACP continuously monitors to detect if changes needed Upon link failure LAG delivers traffic at a reduced rate Thus LAG can be used as a primitive protection mechanism When used this way it is called worker/standby or N+N mode The restoration time will be on the order of 1 second

  50. G.8031 Q9 of SG15 in the ITU-T is responsible for protection switching In 2006 it produced G.8031 Linear Ethernet Protection Switching G.8031 uses standard Ethernet formats, but is incompatible with STP The standard addresses • point-to-point VLAN connections • SNC (local) protection class • 1+1 and 1:1 protection types • unidirectional and bidirectional switching for 1+1 • bidirectional switching for 1:1 • revertive and nonrevertive modes • 1-phase signaling protocol G.8031 uses Y.1731 OAM CCM messages in order to detect failures G.8031 defines a new OAM opcode (39) for APS signaling messages Switching times should be under 50 ms (only holdoff timers when groups)

More Related