560 likes | 643 Views
Sources of Unreliability in Networks. James Newell CS 598IG – Scattered Systems October 21, 2004. Papers. The Synchronization of Periodic Messages Internet Routing Instability Characterising the Use of a Campus Wireless Network. The Synchronization of Periodic Routing Messages.
E N D
Sources of Unreliability in Networks James Newell CS 598IG – Scattered Systems October 21, 2004
Papers • The Synchronization of Periodic Messages • Internet Routing Instability • Characterising the Use of a Campus Wireless Network
The Synchronization of Periodic Routing Messages Sally Floyd and Van Jacobson IEEE/ACM 1994
Overview • Many sources of periodic network traffic • Router updates • Streaming media applications • Overtime periodic traffic can become synchronized! • Synchronization leads to unbalanced traffic • Packet loss • Increased latency
Examples from Internet • DECnet’s DNA Phase IV on LBL (1988) • NEARnet core routers (1992)
Background • Synchronization results from weakly-coupled interaction • Examples • Thai fireflies • Wall clocks • TCP window cycles • External clock synchronization • Client - Server models
Router Synchronization • Router updates are periodic • Random fluctuations in period • Internal fluctuations cause router synchronization • External fluctuations break-apart synchronized routers Easy to overlook!
Periodic Messages Model Process for Tc Sec • Algorithm • Router A takes Tc seconds to process outgoing update • Router B receives first update packet from A in Td seconds • If A or B receives first packet of update, it processes it in Tc2 seconds • After processing, sets timer between Tp –Tr and Tp + Tr A Timer Expires Process an additional Tc2 After Tc + Tc2, reset timer between Tp +/- Tr Arrives at time Td B
More on Periodic Message Model • Triggered updates on major changes • Assumptions • No collisions • No lost or retransmitted packets • Similar to real protocols • RIP • IGRP • EGP
Simulations • Initially unsynchronized • Parameters • N = 20 • Tp = 121 sec • Tc = 0.11 sec • Tc2 = 0.11 sec • Tr = 0.1 sec • Td = 0 sec
Analysis • Clusters of synchronized routers form • Largest cluster (size i) dominates • Processes for iTc seconds after first timer expires • “Jumps” (i - 1)Tc each round on graph • Characterizes state of graph • Synchronized groups can merge
Variation of Tr 1.4 Tc 1.0 Tc 0.6 Tc Unsynchronization -> Synchronization 2.5 Tc 2.8 Tc 2.3 Tc Synchronization -> Unsynchronization
Markov Chain Model • N nodes that implement the Periodic Chain Model • Each state signifies largest cluster size • Smaller clusters are all of size one • Only move one state per round
Cluster Breakup • Assume Tc < 2Tr + Td • Pi,i-1 = P(Tc < L + Td) = i Tc - Td 1 - 1 < i ≤ N 2Tr Tp Tp + Tr Tp - Tr Tc M1 resets M3 expires M2 expires M1 expires Cluster i = 3 timer resets
iTc Tc Cluster Growth • Clusteri has processing time of iTc • First timer expires Tp – Tr(i-1)/(i+1) • Clusteri “jumps” (i-1)Tc – Tr(i-1)/(i+1) compared to single groups • Assumes distance is Tp/(N – i +1) • Pi,i+1 = 1 - e-((N-i+1)/Tp)(i-1)Tc-Tr(i-1)/(i+1)) • P1,2 is left as a variable (dependent on Tr) 1 < i < N
Synchronization • Define f(i) as number of rounds till cluster size first = i starting from 1 See Appendix A for derivation
Breakup • Define g(i) as number of rounds till cluster size = i starting from N
Evaluation of Analysis • Markov model 2~3 order larger than simulation • Rough approximation (Not Predictive) • Captures qualities behavior (Explanatory) • Grossly overestimates for large values of N and Tc
Analysis Results • Choosing Tr as a small multiple of Tc is usually effective at preventing synchronization • Transition to synchronization is abrupt • Paper recommends Tr = Tp / 2 to cover all parameters 3000 years 16 Minutes
Group Size • Steady-state behavior is bimodal • Almost always unsynchronized • Almost always synchronized • Addition of just one node can change synchronization modes Tr = 0.30 Tr = 0.11
Delayed Transmission • In reality, Td≠ 0 even in small latency networks • If Td > Tc, little coupling takes place • When 0 < Td < Tc, synchronization is inversely related to Tc – Td • Remember pi,i-1 = i Tc- Td 1 - 2Tr
Topologies • Assumed mesh model • Model applies to some topologies • Ring • Model breaks for other topologies • Star
Conclusions • Periodic messages from routers can inadvertently synchronize • Emergent behavior with abrupt transition • Synchronization can be overcome • External random component (Tp/2) • Routing timer independent of incoming events • No triggered updates • Account for random bursts of traffic
Discussion • How could have the problem evolved in today’s Internet? • Are there better solutions than adding a large random component? • How would the random component affect the performance of the protocol? • How does synchronization happen on WANs where Td can be very large?
Internet Routing Instability Craig Labovitz G. Robert Malan Farnam Jahanian SIGCOMM ‘97
Overview • Message analysis of inter-domain traffic at major Internet backbones • Rapid changes in node reachability causes network instability • Packet loss • Increased latency • Slower convergence • Connectivity loss!
Internet Background • Internet is composed of various autonomous systems (AS) connected by backbones • Each AS contains disparate administrative and routing policies • AS boundary routers peer routing information about reachability of IP blocks (prefixes)
BGP • Border Gateway Protocol (BGP) is used by AS to exchange updates • Uses incremental updates • Topology changes • Policy changes • Routes are defined by their ASPATH and prefix handle • Peer links are built using TCP -> congestion back-off!
BGP Example • Each AS will append itself to the path during an update • AS’s need to keep a default-free routing table of all visible prefixes AS2 AS3 110.10.0.0/16 AS1 AS5 155.10.0.0/16 AS4 128.10.0.0/16 110.10.0.0/16 AS2 AS3 128.10.0.0/16 AS2 AS4 155.10.0.0/16 AS2 AS4 AS5
BGP Routing Updates • Two forms of updates • Announcements of a new path or destination • Withdrawals or earlier announcements • Explicit – using “withdrawal” message • Implicit – using announcement to bypass AS • During steady-state updates should only occur during • Local policy changes • Network additions
Major Findings • Amount of updates much higher than expected • Pathological and redundant updates dominate routing traffic • Redundant messages are periodic and high-frequency • Update traffic correlates with network usage • Instability cannot be attributed to a small group of AS or routers • Significant amount of forwarding instability occurs
Gauging Internet Instability • Collected BGP messages at various Internet backbones • Taxonomy of BGP updates • Forwarding instability • Routing policy instability • Redundant updates
Methodology • Logged BGP updates at 5 major US exchange points (Mae-East) between Jan ’96 and Jan ’97 • Routing servers peer with +90% of ISPs
Problem with Instability • Non-convergence of routes • Dropped and out-of-order packets • Increase latency • Increase memory and CPU for packet queues • Invalid route caches • Route-flapping • BGP Keep-alive messages are dropped/delayed • Oscillations of detection of overloaded routers • Causes more instability due to multiple topology updates -> more route-flapping
Instability Mitigation • Routing dampening • Ignore updates that surpass a defined threshold of parameters • Wait time period T till restart processing • Legitimate updates can be lost during T • Aggregation • Combine smaller prefixes into a super-prefix • Effective only under planned and cooperative networks • Multi-homed stubs cannot be aggregated well
BGP Update Analysis • Taxonomy of routing events • WADiff: Route explicitly withdrawn, different route is announced • AADiff: Route implicitly withdrawn, different route is announced • WADup: Route explicitly withdrawn, then reannounced later • AADup: Route implicitly withdrawn, then reannounced later • WWDup: Repeated explicit withdrawals • AADiff, WADiff, and WADup is instability • WWDup is pathological instability • AADup can be either
BGP Update Analysis • Typical routing table consist of 45,000 prefixes with 1300 AS • Monitored 3 to 6 million updates exchanged each day • Avg. 125 updates per Network per day • Bursty: sometimes hundreds per sec WWDup not shown
Pathology Analysis • Majority of updates are pathological WWDup (0.5 to 6 million) • Transmitted by routers that never announced the path (stateless BGP) • This problem maybe due to a specific type router/provider • Update exhibit a period of 30 or 60 seconds Vendors subsequently fixed stateless BGP Not main source of additional updates
Possible Pathology Origins • Misconfigured CSU clocks • Clocks can drift • Oscillation of valid and corrupted data • Jittered timer with stateless BGP • Synchronization • Non-jitter timers [earlier paper] • Improper interaction with internal gateways
Instability Analysis • Only focus on AADiff, WADiff, and WADup • Temporal Trends • Highest during normal business hours • High during weekend • Low during summer break
Fine-grained Analysis • Focus on Mae East exchange for month of August 1996 • Result: No AS is solely responsible for the instability statistics ISP A was responsible for high amount of international traffic ISP E was going through infrastructure transition
Fine-grained Analysis • Now focus on a per route analysis (ASPATH + Prefix) • Result: No single route consistently dominates the instability statistics 20 to 90% (med 75) had less than 10 80 to 100% had less than 50
Fine-grained Analysis • Temporal properties of update arrival • Measured frequency distribution of instability events • Found that majority (~50%) arrived either on a 30 second or 1 minutes interval • Consistent even for legitimate updates
Conclusion • Instability continues to be a major problem • Over 99% of update events are redundant • Good: Doesn’t effect routing caches • Bad: Sheer volume can cause outages, delays • Instability cannot be attributed to a few guilty ISP, routers, or prefix paths • Exhibits temporal properties • Correlates with network usage • High-frequency periodicity
Follow-up • From Origins of Internet Stability – INFOCOM ’99 • June 1996 – 2 Million packets per day • June 1998 • Several hundred thousands packets per day • More announcements then withdrawals • Majority still duplicate announcements • Oscillating routing announcements occur
Characterising the Use of a Campus Wireless Network David Schwab Rick Bunt INFOCOM 2004
Overview • Analysis of wireless usage at the University of Saskatchewan • Where • When • How much • Trace allows evaluation of network design principles and plans for future development
Campus Characteristics • 40 Buildings with over 363 acres of land • 18,000 students attend the university
Wireless Network Environment • Initial deployment in 2001 with 18 APs • Dispersed through various buildings • Not well advertised • Wireless traffic is routed on a virtual private network with a unique subnet • Cicsco LEAP authentication is used to provide access to wireless
Trace Methodology • Mirrored wireless packets to a computer port monitoring traffic • Used EtherPeek to log packet data • Used LEAP server to track authentication data • Trace began Jan 22, 2003 and lasted one week • Data analyzed with perl script