February 2008

Would Diversity Really Increase the Robustness of the Routing Infrastructure against Software Defects? The answer is: Yes February 2008 Juan Caballero, Theocharis Kampouris Carnegie Mellon Dawn Song Carnegie Mellon & UC Berkeley Jia Wang AT&T Labs

Software defects in routers • Defects in router software not uncommon • Multiple vulnerabilities in routers uncovered • DoS: maliciously crafted packets cause reload [CERT5] • DoS: maliciously crafted packets cause excessive resource consumption [CERT2,CERT4] • Remote execution of system-level commands [CERT3] • Unauthorized privileged access [CERT1] • Possible remote shell execution [CERT7]

Simultaneous router failure • Routing infrastructure highly homogeneous • What if a software defect makes it possible to simultaneously take down many routers? • Worst case scenario. Rare. • But, huge impact  Highly damaging to ISP’s reputation • Diversity • Multiple implementations from different code bases • Reduces number of nodes affected by a bug[Zhang01,Junqueira05,O’Donnell04] • But, how well would it work on routers?

Scope • We focus on the effect on network connectivity • Impact on higher layers left as future work • Includes: routing convergence, packet loss, delay… • Why? • Because no connectivity means no communication • What about fundamental limitations of diversity? • Vulnerabilities that are shared among vendors • General problem with no good solution • Deployment cost • Depends on how much diversity is already available

Statement • This paper does not claim: • that diversity can protect against all software defects • that we should redesign all networks to accommodate for diversity • Rather, we show: • that diversity greatly helps with simultaneous router failures • that networks might already have a surprising amount of diversity • But, it is not used to increase the robustness!

Contributions • Answering four fundamental questions: • How do we measure robustness of a network against simultaneous router failures? • How to best use the diversity? • How much diversity is needed to guarantee a certain degree of robustness? • Is there enough diversity already in the network or do we need to introduce more?

Problem definition • Graph theoretic approach G = (V,E) • Nodes are routers (V), Edges are links (E) • A version of a graph coloring problem where: • Colors represent implementations • A failure is a color removal • Different from well-known optimal coloring problem • Network Robustness = Resilience to simultaneous router failure • How connected is the network when multiple nodes fail? • The goal is to assign a color to each router from a set of k available colors such that the network robustness (Φ) is maximized

Determining the best coloring • Abilene network with 2 colors (k = 2) Φ = 0.23 Φ = 0.18 Φ = 0.05 Φ = 0.42 • We want to automatically select the best coloring

Outline • Introduction • Metrics • Connectivity • Robustness • Algorithms • Evaluation

Metrics • Need metrics to quantify the robustness of the colored graph  the resilience to the failure • We need two types of metrics: • Connectivity metrics: Given a graph determine how connected it is • Many graph connectivity metrics already proposed • We select some existing ones • Robustness metrics: Given a colored graph determine how robust it is • We propose new ones • The robustness metrics will be a function of the connectivity metrics

Connectivity metrics: NSLC • Given a graph determine how connected it is • Normalized size of largest component (NSLC) [Albert00] A B 1 component 2 components NSLC = 1 NSLC = 0.66

Connectivity metrics: PC • Pair Connectivity (PC) [Park03] A B 1 component 2 components PC= 1 PC = 0.33 We have versions of the metrics that support node weights

Robustness metrics • Robustness of a colored graph measures the remaining connectivity when a color is removed • Remove a color => Disconnect all nodes using the color • Robustness is a function of the connectivity metric f applied over the diverse color-removal subgraphs • Probability of failure of each color is unknown • Two metrics: average and minimum (worst-case)

Minimum and average robustness G2 • Average robustness good • Minimum robustness bad • Average robustness can be misleading by itself G2red G2blue NSLC=0.18 NSLC=0.82

Outline • Introduction • Metrics • Algorithms • Evaluation

Algorithms • We have devised a total of 9 algorithms which can be classified into 4 families • Only present the Region coloring algorithms in paper • Rest are on the extended version [ColoringTR] • Region coloring algorithms outperform others in evaluation

Region coloring algorithms • Divide the network into contiguous regions • Regions are automatically found • Includes 2 algorithms: Cluster & Partition • Algorithms accept number of regions (k) as input • Graph partitioning algorithms try to balance the number of nodes in each partition (i.e., region) Region 2 Region 1

Results overview • There is a trade-off usually between perfectly balanced partitions and contiguous partitions • Results will show that: • Balanced regions are better • Slightly imbalanced but contiguous partitions are better than perfectly balanced but discontiguous partitions Region 1 Region 2 Good partition Region 1 Region 2 Region 1 Bad partition

Roles and Replicated nodes • Roles: • Not all routers can use all implementations • Two roles: Access / Backbone • One color-set for each role • Nodes have roles and can only use implementations from the color-set of their role • Replicated nodes: • ISPs usually replicate important nodes • Increases resilience against single node failures • Load-sharing • In real networks, replicas are colored identically • For robustness, replicas need to be colored differently

Extended Partition Algorithm • Color all backbone routers • Create backbone graph by removing all access routers • First color replicas with different colors • Then color rest using partition algorithm • Color the access routers • Create the access graph by collapsing all backbone nodes into a single node • Two cases depending on independence of access / backbone implementations

Outline • Introduction • Metrics • Algorithms • Evaluation

Evaluation Setup Real Rocketfuel Synth. • Metrics + algorithms implemented using the JUNG graph library [JUNG] • Graph clustering algorithm from Wu et al. [Wu04] • Graph partition algorithm from Karypis et al. [Karypis00]

Coloring Algorithms: Setup • Same topology (Tier-1 ISP) colored using different algorithms • Random as “lower bound” • Max as “upper bound”

Coloring Algorithms: Results • Partition/Cluster best on average • Region coloring minimizes impact • Partition best on worst case • More balanced coloring than Cluster • Partition performs close to Max in both average/worst cases • Non-contiguous partitions are bad (dip at k=5)

Redistributing the existing diversity • Tier-1 ISP contains 8 implementations (2 backbone, 6 access) • Due to: legacy routers, vendor change, budget constraints • Two implementations used by 90% of the nodes • What happens if we redistribute the same diversity using our algorithms? • Number of nodes in largest component goes from 5% to 76% • Requires: • Changing the number of nodes that use each implementation • Changing the geographical distribution of the implementations

Minimal diversity for decent robustness • Two colors are enough for the backbone • Most backbone routers are replicated • Decent robustness starts with 3 colors for access routers • More than 5 colors for access routers do not buy much

Related Work • Diversity as solution against software defects • Diversity in all network layers [Zhang01] • Diversity in distributed systems [Junqueira05] • Diversity to slow malware propagation [O’Donnell04] • Analysis of the Internet robustness[Albert00, Faloutsos99, Li04, Magoni03, Palmer01, Park03, Tangmunarunkit02, Zegura97] • Analysis of failures in networks [Markopoulou04, NIST02] • Router-level topologies [Spring02] • Node Importance metrics [Freeman77, Lorrain71, Newman02, Tauro01] • Clustering and Partitioning [Karypis00, Wu04, etc]

Conclusions • How do we measure robustness of a network against simultaneous router failures? • Proposed robustness metrics • How to use the diversity best? • Proposed coloring algorithms that achieve robustness close to the one obtained by a fully connected network • How much diversity is needed to guarantee a certain degree of robustness? • Not much. 2 backbone + 3 access for Tier-1 ISP • Is there enough diversity already in the network or do we need to introduce more? • Amount of diversity surprisingly high • Redistributing the diversity can increase the number of nodes surviving a failure from 5% to 76%

Questions?

February 2008

February 2008

Presentation Transcript

20 February 2008

February, 2008

2 February 2008

February, 2008

Kyiv, February, 2008

February 27, 2008

February, 2008

February 2008

February 2008

February 2008

February 2008

February 2008

February, 2008

February , 2008

February 2008

February 2008

February 2008

February 2008

February 2008

February 29, 2008

February, 2008