320 likes | 364 Views
Evaluating Potential Routing Diversity for Internet Failure Recovery. *Chengchen Hu, + Kai Chen, + Yan Chen, *Bin Liu *Tsinghua University, + Northwestern University. Internet Failures. Failure is part of everyday life in IP networks
E N D
Evaluating Potential Routing Diversity for Internet Failure Recovery *Chengchen Hu, +Kai Chen, +Yan Chen, *Bin Liu *Tsinghua University, +Northwestern University
Internet Failures Failure is part of everyday life in IP networks e.g., 675,000 excavation accidents in 2004 [Common Ground Alliance] Network cable cuts every few days … Real-world emergencies or disasters can lead to substantial Internet disruption Earthquakes Storms Terrorist incident: 9.11 event …
Example: Taiwan earthquake incident Large earthquakes hit south of Taiwan on 26 December 2006 Only two of nine cross-sea cables not affected There are abundant physical level connectivity there, but the it took too long for ISPs to find them and use them. Page 3 figures cited from "Aftershocks from the Taiwan Earthquakes: Shaking up Internet transit in Asia, NANOG42"
How reliable the Internet is? • Internet is not as reliable as people expected! [Wu, CoNEXT’07] • 32% ASes are vulnerable to a single critical customer-provider link cut • 93.7% Tier-1 ISP’s single-homed customers are lost from the peered ISP due to Tier-1 depeering • Our question: can we find more resources to increase the Internet reliability especially when Internet emergency happens?
Basic Idea • Two places where we can find more routing diversities: • Internet eXchange Points (IXPs) • Co-location where multiple ASes exchange their traffic • Participant ASes in an IXP may not be connected via BGP • Internet valley-free routing policy • AS relationships: customer-provider, peering, sibling • Peering relaxation (PR): allow one AS to carry traffic from the other to its provider • Mentioned in [Wu, CoNEXT’07], but without evaluation • Our main focus: • How much can we gain from these two potential resources, i.e., IXP and PR?
Dataset for Evaluation • Most complete AS topology graph • BGP data • Route Views, RIPE/RIS, Abilene, CERNET BGP View • P2P traceroute • Traceroute data from 992, 000 IPs in over 3, 700 ASes • In total, 120K AS links with AS relationships • http://aqualab.cs.northwestern.edu/projects/SidewalkEnds.html [Chen et al, CoNEXT’09] • IXP data • PCH + Peeringdb + Euro-IX (~200 IXPs) • 3468 participant ASes
Failure Models • Tier-1 depeering • Real example: Cogent and Level3 depeering • Tier-1 provider-customer link teardown • Reported in NANOG forum • Mixed types of link breakdown • 9.11 event, Taiwan earthquakes, 2003 Northeast blackout
Evaluation Metrics • Recovery Ratio • # of recovered <src-dst> AS pairs versus total # of affected <src-dst> AS pairs • Path Diversity • # of increased link-disjoint AS paths between affected <src-dst> AS pairs • Shifted Path • # of link-disjoint AS paths shifted onto a normal link after we use IXP or PR resources
Results: Tier-1 Depeering • 36 experiments for 9 Tier-1 ASes • Recovery ratio: most of the lost AS pairs can be recovered
Results: Tier-1 Depeering • Path diversity: multiple AS paths between lost AS pairs
Results: Tier-1 Depeering • Shifted path • On average, 3.75 ~ 17.2 for all 36 experiments • Moderate traffic load shifted onto the unaffected links
Economic model • B pays to A for recovery • Business model • Risk alliance (like airlines): price is determined beforehand • pay on bandwidth & duration or bits (95 percentile) peer A B IXP A B P-C A B P-C A B
Communication channel • Search for peers • Have direct connections to peers • Search for co-located ASes in the same IXP • ASes are connected by switches in modern IXPs • Messages are broadcasted with the help of the switches • Message confidentiality with public key crypto
Automatic communication • Query message (failed AS) • who connected to specific destination ASes • Reply message (surviving AS) • I can provide BW1 bandwidth to the destination AS • ACK (failed AS) • I would like buy BW2 (<=BW1) • Set up BGP sessions • Withdraw BGP sessions
Check available connectivity & bandwidth • Connectivity • traceroute • Available bandwidth • Maximum capacity is already known • Estimate the amount which has been used • Y. Zhang, M. Roughan, N. Duffield, and A. Greenberg, “Fast Accurate Computation of Large-Scale IP Traffic Matrices from Link Loads,” ACM SIGMETRICS, 2003. • Subtract
Optimal selection of helper ISPs • From a single victim ISP perspective • Buy transit from a minimal number of ASes • Recover all the (prioritized) traffic • Least cost
Selection heuristic is how much bandwidth AS j could provide to Di; Lost connectivity to {Di}, with bandwidth demand {Bi}
Selection heuristic Score each (helper) AS j with Select the AS with largest score (select the one with lowest price if same score) Lost connectivity to {Di}, with bandwidth demand {Bi} 3 2.3 5 2.1
Selection heuristic updated Update Lost connectivity to {Di}, with bandwidth demand {Bi}
Selection heuristic rescore and select Lost connectivity to {Di}, with bandwidth demand {Bi} 0.3 1 0 0.1
Summary First work to evaluate the potential routing diversity via IXP and PR with the most complete AS topology graph. 40%-80% of affected <Src, Dst>AS pairs can be recovered via IXP and PR with multiple paths and moderate shifted paths. Point out a new venue for Internet failure recovery. Possible and practical mechanisms to utilize potential routing diversity. Look forward to feedback and collaborations from IXP/ISPs!
Thank you! Q&A
Failure Models • Tier-1 depeering • Real example: Cogent and Level3 depeering • Tier-1 provider-customer link teardown • Reported in NANOG forum • Mixed types of link breakdown • 9.11 event, Taiwan earthquakes, 2003 Northeast blackout
Results: Tier-1 provider-customer links teardown • Recovery ratio • Path diversity • 4.64 for 10 Tier-1 provider-customer links teardown • 4.54 for 20 Tier-1 provider-customer links teardown • Shifted path • The average number of shifted path when 10, 20 and 30 links are damaged are 3.4, 4.0 and 4.2, respectively.
Results: Mixed types of links breakdown • Taiwan earthquake, 9 big victim ASes • Recovery ratio
Results: Mixed types of links breakdown • Path diversity
Results: Mixed types of links breakdown • Shifted path
System framework • Adding an Emergency Recovery (ER) module in a router’s control plane • Setting up the communications between ER and the Intra-TE Resource Management modules.
Building communication channel • An example
Optimal selection of ISPs to help • From global view • Min. shift path or tuned AS-links • st. recover all the (prioritized) traffic we could or • Max. recovery ratio • st. shift path or tuned AS-links • From a single ISP • Min. cost for the ISP • st. recover all the (prioritized) traffic we could or • Max. recovery ratio • st. cost for the ISP
Selection heuristic • Lost connectivity to {Di}, with bandwidth demand {Bi} • is how much bandwidth AS j could provide to Di; • Score each (helper) AS j with • Select the helper AS with largest score (select the one with lowest price if same score) • Update {Di} by deleting the recovered AS • Update {Bi} by subtracting the recovered bandwidth • rescore and select the next helper AS • Iteration till all are recovered