240 likes | 251 Views
Explore definitions, applications, and strategies for MPLS protection and restoration in network failures, including failure modes, backup resources, failure analysis, and IP protection methods. Improve network reliability and minimize disruptions.
E N D
Protection and Restoration • Definitions • A major application for MPLS
The problem • Network resources will fail • Nodes and links • IGP will re-converge • But this may take some time • 10s of seconds • Fast convergence has a price • May make IGP more sensitive/unstable • I may have sensitive traffic that can not afford interruptions • Voice, Consumer TV • Do something for the time until IGP re-converges
Terminology • Restoration • Bring traffic back to normal • Backup • Alternative resources to be used when there is a failure • Protection • Determine and allocate the backup resources before the failure • When there is a failure just activate them • Can be very fast • Repair • Determine, allocate and activate the backup resources after the failure • Will be slower
Failure Modes • Single vs. multiple link failures • If duration of link failure is short, can assume that there will be only a single link failure • Much harder to deal with multiple link failures • Node vs. link failures • Can assume that links will fail more frequently than nodes • Node failures are harder to handle
Backup resources • Can be multiple types • Links • Paths • Trees • Cycles • Whole topologies • In order to avoid network overload after a failure need to have some extra capacity for backup resources • Problem is how to engineer them so as not to make the network too expensive • Minimize the amount of backup capacity that is reserved
More jargon • 1:1 • 1 working, 1 backup • Wastes a lot of bandwidth for the backups • 1:N • N working and 1 backup • Assume that only 1 working will fail • Then 1 backup is enough – save bandwidth • Revertive: • when the failure is fixed, revert to the primary • SRLG: Shared Risk Link Group • A set of network links that fails together • E.g fibers that are in the same conduit • A bulldozer will cut all of them together
Other issues • How to detect the failure fast • BFD is one general solution • There are medium specific solutions • OAM for ATM • Alarms for SONET • Preferable if they exist • Protocol mechanisms (RSVP HELLOs, OSPF HELLOs, etc) • How to activate the backup • I.e how to make traffic use an alternate path, or a tree
Backbone failure analysis • Sprint backbone ca. March 2002 • Link in class website • Monitor IS-IS traffic • Data only for link failures, not node failures • Failure Duration • 50% failures last less than 1 min • 40% failures last between 1 and 20 min • Maintenance • 50% of failures during maintenance windows • Mean time between failure (MTBF) • Mean time between failures varies a lot across links • “good” and “bad” links • 3 bad links account for 25% of the failures
More analysis • Unplanned failure breakdown • Shared link failures = 30% • Router related = 16.5% • Optical related = 11.5% • Individual link failures = 70% • Node failures less common that single link failures • About 16.5% of failures affect more than 1 link
Handling failures with IP • Easy case • ECMP, no need to do anything extra during failure • But it may not repair all failures • Coverage: what percentage of the possible failures can be repaired • In general activating backup resources is hard with IP • Packets will follow the IP route table/FIB • Forwarding is hop-by-hop • Even if I compute a backup link for a failure, I have no control what will happen after the next hop • May have routing loops
IP protection • Backup next-hop • Each node computes a backup nexthop for each destination • so that I will not have routing loops • It may not have 100% coverage • For more general solutions I need tunneling • Must force packets to reach their destination • Without crossing the failed resource • Tunnel to the node after the failed link • Tunnel to an intermediate node • IP tunneling is an expensive operation • It is packet encapsulation
Not-Via addresses • Consider router A, with interfaces A1, A2, A3 • A1 connects to interface B1 or router B, • A2 connects to interface C2 of router C • B1 has a second address B1-not-via-A • All routers compute paths to B1-not-via-A by removing router A from topology and running SPF • When router A fails, if C wants to reach B sends packets to address B1-not-via-A • Encapsulates the packets • 100% coverage • Can handle node and link failures • Still needs encapsulation
Multi-topology protection • New approach • Have multiple subsets of the topology • IGP protocols already support multi-topology routing • Switch to a different topology when there is a failure • By modifying the header of the packet • Or even using an MPLS label • Allows for more flexible routing of traffic after a failure
Using MPLS • MPLS can conveniently direct traffic where I want • Ideal for setting up backup resources • Mostly backup paths • Can be used to repair both IP and MPLS failures (I.e. LSP failure) • LSP protection can be • Path • Local
Path protection • For each LSP (primary) have a backup LSP • It is already established (with RSVP) but it is not carrying any traffic • Primary and backup LSPs should be link and node disjoint • When there a failure the source of the LSP will start sending traffic to the backup • Source needs to be notified for the failure • May take some time for the repair of the traffic • Can work in both 1:1 and 1:N modes
Local protection • When a link or node fails the node upstream from the failure repairs the traffic • Traffic is put into a back LSP that does not go over the failed resource • Backup LSP merges with the primary LSP • Repairing router does not send a PATHerr upstream • Instead notify upstream nodes that it is repairing the failure • It is very fast • Can work in 1:1 and 1:N modes • Can be • Node • Bypass a failed node • Link • Bypass a failed link
Link local protection • The node upstream of the failed link initiates the protection • Point of local repair (PLR) • Backup LSP will merge back to the primary one • At the next-hop (Nhop) of the PLR • Can work in 1:1 and 1:N modes • Usually a single backup LSP protects multiple primary LSPs • Else scalability is not good
Node local protection • When a node fails, assume its links have failed too • The node upstream of the failed node initiates the protection • Point of local repair (PLR) • Backup LSP will merge back to the primary one • At the next-next-hop (NNHop) of the PLR • What label does the NNHop use for the primary LSP? • Need RSVP’s help to find out • Will need multiple backup LSPs for each node • At least one for each NNHop • Can optionally configure more
Label stacking • Each time I send traffic into an LSP I push a label on the packets • Packets in the primary LSP already have a label • I create a label stack • Top label is popped by the router just before the merge point • A catch • At the merge point, packet arrives from an interface different than the expected one • Must have global (platform) label space
Need some RSVP support • If the LSP is protected do not send a errors upstream/downstream when there is a failure • Instead notify upstream nodes that repair is in progress • During failure the PATH,RESV for the primary LSP must continue • Send them through the backup LSP • For node protection need to know the label the NNHop is using for the primary • Use the record label option for the LSP • All the labels used in all the hops are recorded in the RESV message
LSP protecting IP • Can use the above techniques to also protect IP traffic • If a link fails all the traffic that would go through the link is sent over the backup LSP • Similar for node failures • But in this case, do I know the nnhop for IP? • In general, If I have MPLS in my network all my traffic will be inside MPLS tunnels anyway
Observations • If node degree is d and I have N nodes then • I need at least O(Nd) tunnels for link protection • And at least O(Nd^2) for node protection • Of course I can not protect from failures of the ingress or egress node • The assumption is that failures will be short lived • Traffic may be unbalanced during the failure • Links can get overloaded
The resource allocation problem • How do I setup the backup tunnels so that • I do not overload any link after a failure • I minimize the amount of extra bandwidth that will need to be reserved for the backups • It is a form of traffic engineering (TE) • We will see more on TE later on • Has been studied a lot • In optical and telephone networks • And recently in MPLS type networks • Solutions can be • On-line (as the requests arrive) • Off-line
Example • Kodialam, Lakshman, 2001 • Local link and node protection • Assume I know the b/w demands of all LSPs • Assume that only one link or node can fail at a time • Find a set of backup paths that minimizes the amount of bandwidth for both primary and backup LSPs • Backup LSPs can share bandwidth on some links • What do I know about the links? • How much bandwidth is used by each LSP • Complete but expensive to maintain • How much bandwidth is available • Almost zero information • How much bandwidth is used by backup LSPs • Little bit better than zero