250 likes | 426 Views
Network Resilience: Exploring Cascading Failures. Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst). Similar events were observed on July 19 th , the day CODE RED spread.
E N D
Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst)
Similar events were observed on July 19th, the day CODE RED spread http://www.renesys.com/projects/bgp_instability Prologue On Tuesday, September 18, simultaneous with the onset of the propagation phase of the Nimda worm, we observed a BGP storm. This one came on faster, rode the trend higher, and then, just as mysteriously, turned itself off, though much more slowly. Over a period of roughly two hours, starting at about 13:00 GMT (9am EDT), aggregate BGP announcement rates exponentially ramped up by a factor of 25, from 400 per minute to 10,000 per minute, with sustained "gusts" to more than 200,000 per minute. The advertisement rate then decayed gradually over many days, reaching pre-Nimda levels by September 24th.
Cascading Failures? Conjecture • The viruses started random IP port scanning • Most of these random IP addresses were not in the cached entries of the routing table, causing.... • frequent cache misses, and.. • in the case of invalid IP addresses, generation of ICMP (router error) messages.. • …both of the above causes led to router CPU overload, causing routers to crash • Router failure led to withdrawal announcements by the peers, generating a high level of advertisement traffic. • When the router came back on, it required a full state update from it's peers, creating a large spike in the load of it's peers that provided the state dump • Once the restarted router obtained all the dumps, it dumped its full state to all its peers, creating another spike in the load.. • Frequent full state dumps led to more CPU overload, leading to more crashes, and the propagation of the cycle...
Outline • Background • Modeling interactions • A Fluid model • Phase transitions • A Birth-Death model • More phase transitions • Insights • Future work
Studies in Cascading Failures • Cascading failures studied extensively in Power Networks (Zaborsky et al.) • Coupling in Power Networks between nodes well understood: e.g. differential equations describe voltage-phasor-load relationships • Coupling in data networks: Routing, Traffic engineering, policy routing, DNS…difficult to model!
Modeling interactions • We model coupling at BGP level • Study the interaction of a clique of BGP routers • Model three different kinds of phenomena: router crash, router repair and full state updates • System essentially forms a mutual aid collective
Clique of routers • Routers form a fully connected graph • All routers are peers of each other • At the AS level, BGP routers form a clique of • the order of 540 nodes
A fluid model for interactions • We consider a clique of N nodes • Study process of nodes that are down, D • ks : Rate at which single up node brings up down nodes • kl : Rate at which full state updates brings down up nodes • Typically, expect ks >>kl
Drift equations • a(t) = Number of arrivals in [0,t) da(t) = (N-D)*D*ksdt • d(t) = Number of departures in [0,t) dd(t) = D *(N-D) /D kldt = (N-D) *kldt • Now, consider the drift in down nodes D dD(t) = da(t) - dd(t)
Dynamics of D System shows Phase Transition If D(0) > ks /kl else
Phase transitions N = 100 ks /kl = 20
Properties of phase transition • Threshold is an absolute quantity rather than a fraction • Cliques with “powerful” (i.e., ks /kl high) nodes do not exhibit cascading failures • Smaller cliques more resistant to phase transitions
A Birth-Death model • Again consider a clique of N nodes • The system state i is the number of down nodes • Transitions rates are state dependent l0 l1 li lN-1 0 1 i i+1 N-1 N mi m0 m1
Transient model • Since mN =0, state N is an absorbing state • System ends up in N with probability 1 • Perform transient analysis, compute mean time to absorption, Wi starting from state i • Wi good indicator of stability of system, a low value indicates propensity to collapse to state N (where all nodes are down) • Physically, interpret Wi as the ability for the system to recover if it ends up in state i through some exogenous process (e.g. attacks)
Solution for Wi With boundary conditions and
Solution (cont.) and Yield a way to compute Wi
Modeling transition rates li =(N-i) *i *kl + ka ka =ambient traffic load, kl similar to fluid model ks similar to fluid model mi =(N-i) *ks
The mean time to absorption N=20, ks =1, kl=0.01 System stable, mean time to absorption of the order 1026 , even if only one node is up
A larger clique N=100, ks =1, kl=0.01 System still stable, mean time to absorption of the order 1048 , if only one node is up
The appearance of phase transitions N=200, ks =1, kl=0.01 Mean time to absorption goes down from 1047 , to about 0 in a matter of few states
Dependence on service rate/load Transition point shifts right as ratio goes up
Dependence on clique size Transition point remains roughly the same, relative stability goes down as N goes up
Early conclusions • Cascading failures possible in mutual support systems like a BGP clique • Presence of phase transitions depends on system parameters strongly • Clique size an important threshold, larger cliques more likely to undergo cascading failures
Future work • Refine model, plug in numbers for parameters • Look at different topologies • Do more detailed modeling of single router (fixed point solutions)