100 likes | 205 Views
A Study of Multiple IP Link Failure. Fang Yu fyu@eecs.berkeley.edu. Motivation. Baltimore Tunnel Fire 18 July 2001
E N D
A Study of Multiple IP Link Failure Fang Yu fyu@eecs.berkeley.edu
Motivation • Baltimore Tunnel Fire 18 July 2001 • “… Keynote Systems … says the July 19 Internet slowdown was not caused by the spreading of Code Red. Rather, a train wreck in a Baltimore tunnel that knocked out a major UUNet cable caused it.” • “The fire severed two OC-192 links between Vienna, VA and New York, NY as well as an OC-48 link from, D.C. to Chicago. … Metromedia routed traffic around the fiber break, relying heavily on switching centers in Chicago, Dallas, and D.C.” • “Traffic slowdowns were also seen in Seattle, Los Angeles and Atlanta, possibly resulting from re-routing around the affected • “The accident caused certain connections 10 times slower than normal, such as the ones between Washington, D.C., and San Diego backbones.” R. Katz, “CS294-3: Distributed Service Architectures in Converged Networks”
Transport Layer Why Multiple IP Link Failure? IP Layer Multiple Link Failure Does Occur! J. Strand “Optical Network” with Modification
Methodology • Use SSFNet network simulator with OSPF v2 extension (RFC 2328) • Use a 24-node 54-link IP network from SSFNet • Evaluation matrix • Convergence Time • Number of Route Changes • Loop: • record the duration of each loop • sum up as total loop time • Invalid Routes: • record the duration of the each route containing failed link • sum up as total Invalid route time • Unreachable Routes: • Record the total number of routes failed due to network partitioning
A Case Study of Two-Link Failure • Two links fail simultaneously from time 50s to 150s. • OSPF detects the first link failure at time 81.03 second, converges at time 86.04 second • OSPF detects the first link up at time 156.08 second, converges at time 161.1 second
LSA Update Messages and Route Pathology Caused at Link Down Time
OSPF Convergence Time and Number of Route Change • There are dramatic differences between failure cases • 3-link failure converges slower than node failure although node failure brings down an average of 4.5 links! • Neighbor node has some what synchronized clock detect multiple link failure almost at the same time • 55% cases, node failure generates less route changes than multiple failure cases
Route Pathology • Node failure causes a lot of loop routes • Multiple Link failure cause more invalid routes
Summary of Observations • Multiple link failure is more problematic than node failure • Each node failure will bring down • An average of 4.5 links • All the connections originated from it or destined to it. • 2-link failure and 3-link failure cause • average 1/2 or 2/3 less link failures than a node failure • 55% of the cases, it will create more route changes and 10 times more invalid routes during OSPF re-route time. • Different combinations of multiple link failure have dramatic different impact on OSPF • Reason: The multiple link failure won’t bring down nodes, so OSPF has to re-route a larger number of connections compared to node failure
Future Work • Study on the real IP network • E.g. AT&T Common IP Backbone • Study of correlated IP link failure based on optical topology • IP link failure are not randomly correlated • A Fiber cut will cause more link failures • Propose multi-layer routing scheme to effectively deploy IP layer on Optical network • Avoid severe multiple IP link failure scenarios • Minimize the re-route duration and route pathology under failure