260 likes | 355 Views
Self-healing in Routing: Failure Analysis, and Improvements. Qi Li Tsinghua University Aug. 28, 2008. Outline. Problem Statement Analysis of Self-Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work. Problem Statement.
E N D
Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008
Outline • Problem Statement • Analysis of Self-Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Conclusion and Future work AsiaFI, Student Workshop
Problem Statement • Routing (Intra- and Inter- domain) is critical elements as Internet infrastructure • How robust are they against large scale failures/attacks? • Cisco routers caused major outage in Japan 2007 • Earthquake in Taiwan causes undersea cable damage in 2006 • We need to improve them, but how can we do? AsiaFI, Student Workshop
Internet Routing • Not a homogeneous network • A network autonomous systems (AS) • Each AS under the control of an ISP. • Large variation in AS sizes – typical heavy tail. • Inter-AS routing • Border Gateway Protocol (BGP). A path-vector algorithm. • Serious scalability/recovery issues. • Intra-AS routing • Several algorithms; usually work fine • Central control, smaller network, … AsiaFI, Student Workshop
Measurements – Prefix Growth • Table sizes grow 2x faster than real growth • One (conservative) analysis predicts 2M entries in 10 years AsiaFI, Student Workshop
Measurements – BGP Updates AsiaFI, Student Workshop
Distribution of Updates – Main Observation • Most of the network is very stable • Parts of the network are very unstable • Everybody pays for the instability • Problem is getting worse AsiaFI, Student Workshop
Routing Failure Causes • Large area router/link damage (e.g., earthquake) • Large scale failure due to buggy SW update. • High BW cable cuts • Router configuration errors • Aggregation of large un-owned IP blocks • Happens when prefixes are aggregated for efficiency • Incorrect policy settings resulting in large scale delivery failures • Network wide congestion (DoS attack) • Malicious route advertisements via worms AsiaFI, Student Workshop
Outline • Problem Statement • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-healing Solution • Conclusion and Future work AsiaFI, Student Workshop
Existing Routing Protocols • Normal process of IP-based self-healing routing • Failure Detection • Failure Notification • Forwarding Path Re-computation • Existing routing protocols … • RIP: hundreds of seconds, count to infinity • OSPF, tens of seconds • BGP, several minutes or longer, can’t converge due to policy confliction. AsiaFI, Student Workshop
The State Transition under Failure • A simple state transition to analyze the routing convergence. AsiaFI, Student Workshop
The Problems of Transient Failures • Routing Blackhole • Traffic is silently dropped without informing the source that the data did not reach its intended recipient. • Routing Loop • The path to a particular destination forms a loop. AsiaFI, Student Workshop
Outline • Problem Statement • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Conclusion and Future work AsiaFI, Student Workshop
Traditional Fast Reroute Solutions • Major improvement in Intra-domain routing is fast reroute solutions. • SONET rings are significantly reduce this recovery time, but they are expensive. • FRR with MPLS-TE, hard to deploy because it will introduce much complexity into core network. • IP-FRR developed by IETF, which still has some shortcomings, e.g., LFA needs a neighbor with a shortest path not containing the failed nodes. • Layer 3 Tunnel provides pre-computed path protection, which may not eliminate the routing loops introduced by tunneling. AsiaFI, Student Workshop
State Transition of Improved Solution • State transition with protection and damping: improving availability and stability. AsiaFI, Student Workshop
BGP Fast Convergence Solutions • Major Problem in BGP • Theoretical analysis and measurement result indicate path exploration of path vector protocol prolongs routing convergence • Several solution addressed this problem: • RCN can eliminate all the obsolete routes and ensure that only valid alternative routes are chosen and propagated by carrying the root-cause information in the BGP updates. • Ghost Flushing improves the BGP convergence by expediting the removal of outdated “ghost” information in the Internet. • Drawbacks … • Network fail-over events in GF, Transient routing problems.… AsiaFI, Student Workshop
Outline • Problem Statement • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Requirements of Solution • Routing Protection • Evaluation Metrics • Conclusion and Future Work AsiaFI, Student Workshop
Self-healing Routing • The goal of self-healing routing • After a link or a node is devastated, network can restore or repair routes by itself • Self-healing routing approaches • Routing Restoration (Fast Routing Convergence) Attempt to find a new path on-demand to restore connectivity when a failure occurs. • Routing Protection Based on the fixed and predetermined failure recovery, provide a working path set up for traffic forwarding and an alternate protection path. AsiaFI, Student Workshop
Requirements of Solution • Simplicity • The solution should be simple and not add much complexity in core networks, but MPLS needs a fundamental infrastructure. • Easy Deployment and Management • MPLS-related solution is not a good potential solution because it is hard to pre-compute backup path for every nodes. • Efficiency • Protection should not be deployed to cover 100% of network, especially when multiple failures happen. • Incremental Deployment Support • It is an important factor when considering and designing a novel routing protocol, because we all can not ensure that we can deploy it once. AsiaFI, Student Workshop
Requirements of Solution (cont.) • Business model Support • The designed solution should consider the business model of path protection application in production networks. • In order to protect unstable network and backbone network areas, contrasts between different ISPs should be signed to guarantee routing availability in these areas. • Low Cost • The path protection solution should provide routes without many computation processes or additional computation power needed on routers, and provide packet delivery performance guarantee with low packet loss. • The solution should covers protection under both short term or long term network failures. AsiaFI, Student Workshop
Principle of our solution (cont.) • The key idea of routing protection is that it makes tradeoff between the additional cost introduced by tunneling and packet lost caused by failures. • Fast Failure Detection • simplicity, fast detection, easy implementation and no change to existing routing protocols, • Bidirectional Forwarding Detection (BFD) is directly applied. • Path Protection Technique • Although two different types of routing protocol need be considered, intra-domain routing and inter-domain routing tunnel, there is no need for us to provide path protection techniques for different routing instances. • In order to eliminate the problems introduced by L3 tunnel, we choose L2TP as protection technique. AsiaFI, Student Workshop
Principle of our solution (cont.) • TunnelDeactivation • Tunnels should be deactivated if the short term failure recovers or route converges again after a long term failure, e.g. for the view of loop avoidance or performance. In this situation, tunnel inactivation mechanism is essential to guarantee normal data forwarding. LAC: L2tp Access Concentrator LNS: L2TP Network Server AsiaFI, Student Workshop
Evaluation metrics of routing system • Two metrics to evaluate routing system • Availability refers to the ability of routing system to work for normal packet delivery no matter whether network failures happen. • Stability refers to routing dynamic of routing system no matter network failures happen. • Routing paths provided by tunnel guarantee routing availability, while delayed route updates during long-term failures or eliminated route updates during short-term failures improves stability of routing systems. AsiaFI, Student Workshop
Outline • Problems • Analysis of Self Healing Routing • Existing Improvement Solutions • Our Self-Healing Solution • Conclusion and Future Work AsiaFI, Student Workshop
Conclusion and Future Work • A lot of interesting problems in the Internet • The routing issues in Internet are being addressed actively. • Many of the problems are hard – no easy solutions, have to make tradeoffs. • Our solution well addresses the self-healing problems of routing. • Further study and measurement of our solution • Development of the prototype and Experimental analysis on CERNET2 AsiaFI, Student Workshop
Thanks Q&A liqi@csnet1.cs.tsinghua.edu.cn