520 likes | 631 Views
Worms, Viruses, and Cascading Failures in networks. D. Towsley U. Massachusetts. Collaborators: W. Gong, C. Zou (UMass) A. Ganesh , L. Massoulie (Microsoft). Internet as enabler of terrific apps. Internet as enabler of terrific apps … but also of malicious behavior
E N D
Worms, Viruses, and Cascading Failures in networks D. Towsley U. Massachusetts Collaborators: W. Gong, C. Zou (UMass) A. Ganesh , L. Massoulie (Microsoft)
Internet as enabler of terrific apps • … but also of malicious behavior • worms, viruses • Internet as a complex system • critical DNS, BGP infrastructures
Worms and failures • Code Red worm • more than 360,000 infected in less than one day • disrupted parts of BGP infrastructure • SQL Slammer • less than 15 minutes to infect 75,000 hosts • congested parts of Internet • BGP errors in one network → cascade of faults in BGP in another network
Goals • what are appropriate models? • deterministic • stochastic • what makes worm/virus/failure virulent? • how does topology affect virulence?
Outline • worms, deterministic models • cascading failures, stochastic models • summary
Worm spreading behavior • scan for vulnerable hosts • sequential, random, topological • uniform, local preference • virulence sensitive to • scanning strategy • host speed, bandwidth • protocol • …
W N Worm spreading model • address space, size W • N vulnerable hosts • scan rate (per host), h
Simple worm spreading model I(t) - number of infected hosts at time t Epidemic model: with initial condition I(0)
D. Goldsmith K. Eichman scan rate time Code Red: model • measurements from two Class A networks • scan rate I(t) • epidemic model matches increasing part of observed Code Red data (Staniford) What about decrease? • human countermeasures • congestion Zou, etal, 2002
Assumptions • classic epidemic model • ignore countermeasures • ignore congestion • Code Red parameters • h = 358/min • N = 360,000 • uniform scan, W = 232 • I(0) = 10 • 100s minutes to spread
Worm virulence • increase h • increase I(0) • decrease W
Worm virulence • increase h • increase I(0) • decrease W • smarter scanning
The perfect worm • perfect worm • scan vulnerable nodes exactly once • flash worm (Staniford,…) • uniform scan of vulnerable nodes (W = N)
Perfect Code Red worm • I(0) = 10 • h = 358/min • N = 360,000 • all hosts infected within 2 sec. • add 2 sec. infection delay -> six-fold slowdown • random scan almost perfect!
Perfect Code Red worm • I(0) = 10 • h = 358/min • N = 360,000 • all hosts infected within 2 sec. • add 2 sec. infection delay -> six-fold slowdown • random scan almostperfect!
Hitlist, routing worms • hitlist worm • increases I(0) • routing worm • decreases W • BGP table information: W = .29 232 • 29% of IP address space
Hitlist, routing worms • Code Red style worm • h = 358/min • N = 360,000 • hitlist, I(0) = 10,000 • routing worm as effective as hitlist worm • hitlist/routing worm extremely virulent
1 1-p 2 K Local preference worm • K subnetworks • p – probability scan local subnet • (1-p) – prob. scan outside localsubnet p …
Local preference worm • Nk, no. vulnerable hosts in subnet k • Ik(t), no. infected hosts in subnet k • fits epidemic model for interacting groups set of coupled ODEs
Local preference worm • K = 116 • Nk = 360,000/K • I1(0) = 10; Ik(0) = 0, k>1 • h = 358/min • provides some of the locality of a routing worm
Questions • topological worms • sequential scan • bandwidth constraints
topology? • failure recovery?
Topology and fast/slow recovery • model description • general network topologies • conditions for fast-slow recovery • specific network topologies • complete graphs (BGP routers) • hypercubes (peer-to-peer networks) • power-law graphs (Internet AS graph; E-mail address book graph)
Susceptible-Infective-Susceptible (SIS) epidemic model Also known as contact process; see [Liggett] • topology: undirected, finite graph G=(V,E),connected ; • Xv = 1if nodevdown(infected) Xv = 0if nodevup (healthy)
Model • {Xv vV} Markov process on {0,1}V with jump rates: • Xv→ 1 with rate w→vXw • Xv → 0 with rate • unique absorbing state at 0 • all other states communicate, 0 is reachable
Time to absorption • system eventually recovers • how long does this take? • T = time to hit 0(from a given initial condition) • how does E[T] depend on , , G?
Example • G = line segment or ring with n nodes • Fix =1 • Theorem (Durrett and Liu): There is critical c > 0 such that, • if < c , then E[T] = O(log n) • if > c , then log E[T] ≈ na • signature of phase transition in infinite 1-D lattice.
Fast recovery, spectral radius - spectral radius of graph adjacency matrix, A; n=|V| . Then, P(X(t) 0) ≤ c n½ exp([ -]t) Hence, when < , Survival time T satisfies: E(T) ≤ [log(n)+1]/[ - ]
Coupling proof Consider “Branching Random Walk”, i.e. Markov process {Yv}vV • Yv→Yv +1 with rate w~v Yw = (AY)v • Yv → Yv -1 with rate Yv Can couple processes so that, for all t, X(t) ≤ Y(t).
Branching random walk bound By “linearity” of Y, dE[Y(t)]/dt = ( A - I) Y(t), so E[Y(t)] = exp( A - I) Y(0) ; Use P(X(t) 0) ≤vV E[Yv(t)]
Slow recovery Graph isoperimetric constant: “perimeter” S “area”
Slow die-out and isoperimetric constant Suppose for some m ≤ n/2, r := [m] / > 1 Then, with positive probability, epidemics survive for time at least rm/[2m] Hence, if m = na, survival time T satisfies log (E[T]) = (na)
Coupling proof Let |X| = v Xv . Then |X| dominates process Z on {0,…,m} with transition rates: z→ z+1 at rate z, z→ z-1 at rate z. Then study absorption time for Z
Complete graph Here, = n-1, m = n-m By picking m = na, a < 1, Thresholds: fast recovery if / < 1/(n-1) slow recovery if / > 1/(n-na)
Hypercube {0,1}d Here, d = log2(n) and = d For m=2k, k < d, m = d-k Hence, for k = d, Thresholds: , fast recovery if / < 1/d slow recovery if / > 1/[d(1-)]
Erdős-Rényi random graph • edge between each pair of nodes present with probability pn independent of others • dense: dn := npn = Ω(log n) • thenρ ~ ~ dn with high probability
Star network • spectral radius: n1/2 • isoperimetric constant: m = 1 for all m < n/2 • general results not useful Specialized analysis yields: • for arbitrary constant c > 0, if / < c/n1/2, fast recovery, E[T] = O(log(n)) • if / > na-1/2 , for a > 0, slow recovery, log(E[T]) = (na)
Power-law random graph Power-law graph with exponent : number of degree kvertices k- E.g. Internet AS graph with = 2.1 Expected degree PLRG [Chung et al]: • expected degrees w1 > ··· > wn: edge (i,j) present w.p. wi wj/k wk • particular choice: wi = c1(i+c2)-1/( -1)
Power-law random graph (2) Spectral radius of PLRG [Chung et al.,03]: Denote by m max. expected degree (m=w1), and by d average of expected degrees. Then:
PLRG, > 2.5 Epidemics on full graph live longer than on sub-graph. Look at star induced by node 1: slow die-out for / > m-1/2 Compare to spectral radius condition: Fast die-out for / < m-1/2 Two thresholds differ by m ; same gap as for star
PLRG, 2 < < 2.5 Consider top N nodes, for suitable N; Erdős-Rényi core, with isoperimetric constant: = F() Gap between thresholds and : constant factor, F()
Open problems • gap between upper and lower bounds in • sparse ER graphs • power law random graphs for < 2.5 • spectral radius bound tight in examples, always true? • conditioned on slow recovery, how many nodes are down at intermediate times? • extensions to other graphs and to SIR epidemics
Observations • neither parameter tight • gap for topologies with diverse degrees • spectral radius “seems” to be right • nothing between log n and exp(n)?
0110…0xxx 8 Hitlist, routing worms • hitlist worm • increase I(0) • routing worm • decrease W • BGP table information: W = .29 232 • 29% of IP address space • /8 aggregation: W = .45 232 • 116 out of 256 possible 8 bit prefixes
The appearance of phase transitions N=200, ks =1, kl=0.01 Mean time to absorption goes down from 1047 , to about 0 in a matter of few states
Accuracy of fluid model • population: 360,000 • scan rate h = N(358/min, 1002) normal distr. • scanning space: 232 • I(0) =1 • 100 simulations
Accuracy of fluid model • population: 360,000 • scan rate h = N(358/min, 1002) normal distr. • scanning space: 232 • I(0) =10 • 100 simulations
Accuracy of fluid model • population: 360,000 • scan rate h = N(358/min, 1002) normal distr. • scanning space: 232 • I(0) =10 • 100 simulations