Reliability and Resilience in Data Communication Networks

Reliability and Resilience in Data Communication Networks Weidong Cui EECS, UC Berkeley wdc@eecs.berkeley.edu CS294-3, Spring 2002

Outline • Overview • Resilience in Optical Layer • Resilience in IP Layer • Resilience in Application Layer • Resilience in Multilayer Networks • Case Study: Sprint Long Distance Network Reliability • Summary

Overview • Network Reliability • Networks should be able to detect faults and somehow repair themselves before end-users perceive any problem with their communications. • Technical Concerns • Robustness • Efficiency • Expedition • Management Concerns • Cost • Configurability • Interoperability

Terminology • Protection • Uses preassigned capacity to ensure survivability • Restoration • Reroutes the affected traffic after failure occurrence by using available capacity • Survivability • Property of a network to be resilient to failures • Proactive: protection • Reactive: restoration • Path-based vs. Link-based • Dedicated Backup vs. Backup Multiplexing • 1+1 protection vs. 1:1 protection vs. 1:N protection

Link-based vs. Path-based • Link-based • Shorter restoration time • Less efficient. • Can only fix link failures • Path-based • longer restoration time • More efficient. Mohan and Murthy, “Lightpath Restoration in WDM Optical Networks”, 2000.

Dedicated Backup vs.Backup Multiplexing • Dedicated backup • More robust • Less efficient. • Backup multiplexing • Less robust • More efficient. Mohan and Murthy, “Lightpath Restoration in WDM Optical Networks”, 2000.

Resilience in Optical Layer • Why wants resilience in optical layer? • The ever increasing bit-rate makes optical layer failures a significant loss for network operators. • Cable cuts are very frequent? • Fast restoration time, invisible to end-users. • Easy to achieve “physical diversity”

Resilience in Optical Layer • Linear Systems • 1+1 protection • 1:1 protection • 1:N protection • Ring-based • UPSR: Uni-directional Path Switched Rings • BLSR: Bi-directional Line Switched Rings • Mesh-based • Optical mesh networks connected by optical crossconnects (OXCs) or optical add/drop multiplexers (OADMs) • Link-based/path-based protection/restoration • Hybrid Mesh Rings • Physical: mesh • Logical: ring

UPSR vs. BLSR • UPSR • Simple: automatically bridging all traffic counterclockwise at entry nodes • Not efficient • No communication between entry and exit nodes • Switching time is not affected by the number of nodes in the ring. • BLSR • Traffic is routed in both directions • More efficient • Use APS messaging • The requirement of 50ms switching time restricts the BLSR to 16 nodes. www.acterna.com, “The Fundamentals of SONET”.

Resilience in IP Layer • Why wants resilience in IP layer? • Survivability in optical layer is not enough. • Less efficient and expensive • Node failures within a service layer can only be dealt with by the actions of peer-level network elements. • Some networking operating contexts need IP restoration other than physical layer restoration. • No rigid distinction between working and spare capacity (More Efficient!) • Spare capacity can be used by best-effort traffic during normal operation.

Virtual Protection Cycles:p-cycle (I) Protection for link failures (on-cycle and straddling failures) Protection for node failures (encircling p-cycles) Stamatelakis and Grover, “IP Layer Restoration and Network Planning Based on Virtual Protection Cycles

Virtual Protection Cycles:p-cycle (II) • p-cycle can be implemented using IP tunneling or label switching. • A few more routing entries for p-cycle routing in routing tables • Encapsulating original IP packets • Decapsulating p-cycle packets • When the original route cost is larger than the local route cost. • Formulation • Objective: minimize worst-case restoration-induced oversubscription • Mixed integer programming • Performance • Max oversubscription is close to 1.0 • Restoration time is dependent on failure detection time. Stamatelakis and Grover, “IP Layer Restoration and Network Planning Based on Virtual Protection Cycles

Restoration in Different Information Scenarios (I) • None information scenario • Residual bandwidth on each link • Complete information scenario • Full routing information of all the connections currently in progress • Partial information scenario • Residual bandwidth on each link • Total bandwidth used by active (primary) paths on each link • Total bandwidth used by backup paths on each link Kodialam and Lakshman, “Dynamic Routing of Bandwidth Guaranteed Tunnels with Restoration”, Infocom’00. Kodialam and Lakshman, “Dynamic Routing of Locally Restorable Bandwidth Guaranteed Tunnels using Aggregated Link Usage Information”, Infocom’01.

Restoration in Different Information Scenarios (II) • Objective: optimize active and backup path (or bypass path, if locally restorable is desired) routing for accommodating more path requests. • Formulate as integer programming problems • Heuristics to solve the sharing with partial information problem • Performance • Efficiency of restorable routing under partial information scenario is close to the performance under complete information scenario. Kodialam and Lakshman, “Dynamic Routing of Bandwidth Guaranteed Tunnels with Restoration”, Infocom’00. Kodialam and Lakshman, “Dynamic Routing of Locally Restorable Bandwidth Guaranteed Tunnels using Aggregated Link Usage Information”, Infocom’01.

Resilience in Application Layer • Why wants resilience in application layer? • Survivability in IP layer is not widely deployed. • BGP’s fault recovery mechanisms may take many minutes before routes converge to a consistent form. • Application layer restoration has more knowledge of application requirements. • Application service providers can do restoration only in application layer.

Resilient Overlay Networks (RONs) (I) • Goal: • Failure detection and recovery in less than 20 seconds • Tighter integration of routing and path selection with the application • Expressive policy routing • Basic Idea: • Detect problems by aggressively probing and monitoring the paths connecting the nodes • RON nodes exchange information about the quality of the paths among themselves via a routing protocol and building forwarding tables based on a variety of path metrics (latency/loss rate/throughput) • Route packets over the RON rather than the underlying Internet path if the latter is not the best one. D. Andersen, et al., “Resilient Overlay Networks”, SOSP’01.

Resilient Overlay Networks (RONs) (II) • Performance • Average fault detection and recovery time is 18 seconds • 100% in RON1 and 60% in RON2 of the hundred significant observed outages are overcome by RON. • Limitations • Not scalable, RON size is 2 ~ 50. D. Andersen, et al., “Resilient Overlay Networks”, SOSP’01.

Fault Detection and Recovery for Wide-Area Service Composition • Goals • Availability: detect and recover from failures quickly • Performance: choose set of service instances • Scalability: internet-scale operation • Design • Leverage an overlay network of service clusters • Link-state propagation • Need full network topology information • Quick propagation of failure information • Link-state floods is acceptable • Evaluation • Good recovery time for real-time applications: O(3 seconds) • Good scalability: minimal additional provisioning for cluster managers B. Raman, “Wide-Area Service Composition: Availability, Performance and Scalability”, 2002.

Overlay Restoration based on Correlated Link Failures (I) • Motivation • Overlay link failures are correlated! • Goal • Robustness: reserve two paths with minimum joint path failure probability based a correlated overlay link failures • Efficiency: leverage on backup bandwidth sharing W. Cui et al., “Backup Path Allocation based on a Link Failure Probability Model in Overlay Networks”, 2002.

Overlay Restoration based on Correlated Link Failures (II) • Backup Path Routing • Optimal Backup Path Routing (OPR) • Integer quadratic programming • Failure Probability Cost Backup Path Routing (FPR) • Decouple primary path routing and backup path routing • Primary path: shortest path based on default metric • Backup path: shortest path based on failure probability cost • Secondary Shortest Backup Path Routing (SSR) • Same idea as FPC, but • Backup path: (secondary) shortest path link disjoint to the primary path • Backup Path Bandwidth Allocation • No backup bandwidth sharing • Backup bandwidth sharing for single-link-failure recovery • Backup bandwidth sharing for double-link-failure recovery W. Cui et al., “Backup Path Allocation based on a Link Failure Probability Model in Overlay Networks”, 2002.

Overlay Restoration based on Correlated Link Failures (III) • Evaluation • Metric • Fatal path failure probability (robustness) • Number of path requests admitted (efficiency) • Main conclusions • FPR is 15% ~ 25% better than SSR and close to OPR on robustness • The overlay network can admit 100% more path requests by using backup bandwidth sharing than without backup bandwidth sharing. • FPR is tolerant to inaccurate overlay link failure estimates. W. Cui et al., “Backup Path Allocation based on a Link Failure Probability Model in Overlay Networks”, 2002.

Resilience in Multilayer Networks • Why wants resilience in multilayer networks? • Avoid contention between different single-layer recovery schemes. • Promote cooperation and sharing of spare capacity

PANEL: Protection Across Network Layers (I) P. Demeester et al., “Resilience in Multilayer Networks”, 1999.

PANEL Guidelines • Recovery in the highest layer is recommended when: • Multiple reliability grades need to be provided with fine granularity • Recovery interworking cannot be implemented • Survivability schemes in the highest layer are more mature than in the lowest layer • Recovery in the lowest layer is recommended when: • The number of entities to recover has to be limited/reduced • The lowest layer supports multiple client layers and it is appropriate to provide survivability to all services in a homogeneous way • Survivability schemes in the lowest layer are more mature than in the highest layer • It is difficult to ensure the physical diversity of working and backup paths in the higher layer • Using unprotected or preemptible server (lower) paths to carry the client (upper) layer spare capacity is recommended to alleviate redundant protection and remain cost-effective. P. Demeester et al., “Resilience in Multilayer Networks”, 1999.

Protection/Restoration inIP over WDM Networks (I) • Goal • Deliver services reliably among border LSRs (Label Switched Routers) • IP Layer Protection • One physical cut can expand to tens of thousands of simultaneous logical link failures at the IP layer. • WDM Layer Protection • Very low network utilization Y. Ye et al., “A Simple Dynamic Integrated Provisioning/Protection Scheme in IP over WDM Networks”, 2001.

Protection/Restoration inIP over WDM Networks (II) • Dynamic Integrated Approach • Periodically (globally) optimize the network by using offline computation, and then use online dynamic path selection to fine tune between offline calculations • If the source LSR could not locate a link-disjoint backup path from existing lighpaths or found one that had no available bandwidth, the LSR would • Run the path selection algorithm with constraints • If failed, request a new lightpath from the WDM layer and check if it can locate a link-disjoint backup path • If not, drop the flow request and release all the reserved resources. Y. Ye et al., “A Simple Dynamic Integrated Provisioning/Protection Scheme in IP over WDM Networks”, 2001.

Case Study: Sprint Long Distance Network Reliability • Network reliability factors • Transport architecture: SONET 4F BLSR • Redundant equipment • Internal redundancy of ADMs • 2 independent WDM systems in SONET 4F BLSR • Redundant IP routers • Conservative synchronization • Primary reference sources recover clocking from GPS and Loran-C receivers. • Protected power M.L. Jones et al., “Sprint Long Distance Network Survivability: Today and Tomorrow”, 1999.

Summery • Protection/Restoration in physical layer has shorter fault switching time (~50ms) but worse network utilization. • Protection/Restoration in IP layer or application layer may take from several seconds to several minutes but has higher network utilization. • Protection/Restoration in one layer cannot be completely replaced by protection/restoration in another layer. • Integrated protection/restoration across multiple layers needs extensive study.

Reliability and Resilience in Data Communication Networks

Reliability and Resilience in Data Communication Networks

Presentation Transcript

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

System Reliability and Resilience

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks

Data Communication and Networks