Sources of Unreliability in Networks

Sources of Unreliability in Networks James Newell CS 598IG – Scattered Systems October 21, 2004

Papers • The Synchronization of Periodic Messages • Internet Routing Instability • Characterising the Use of a Campus Wireless Network

The Synchronization of Periodic Routing Messages Sally Floyd and Van Jacobson IEEE/ACM 1994

Overview • Many sources of periodic network traffic • Router updates • Streaming media applications • Overtime periodic traffic can become synchronized! • Synchronization leads to unbalanced traffic • Packet loss • Increased latency

Examples from Internet • DECnet’s DNA Phase IV on LBL (1988) • NEARnet core routers (1992)

Background • Synchronization results from weakly-coupled interaction • Examples • Thai fireflies • Wall clocks • TCP window cycles • External clock synchronization • Client - Server models

Router Synchronization • Router updates are periodic • Random fluctuations in period • Internal fluctuations cause router synchronization • External fluctuations break-apart synchronized routers Easy to overlook!

Periodic Messages Model Process for Tc Sec • Algorithm • Router A takes Tc seconds to process outgoing update • Router B receives first update packet from A in Td seconds • If A or B receives first packet of update, it processes it in Tc2 seconds • After processing, sets timer between Tp –Tr and Tp + Tr A Timer Expires Process an additional Tc2 After Tc + Tc2, reset timer between Tp +/- Tr Arrives at time Td B

More on Periodic Message Model • Triggered updates on major changes • Assumptions • No collisions • No lost or retransmitted packets • Similar to real protocols • RIP • IGRP • EGP

Simulations • Initially unsynchronized • Parameters • N = 20 • Tp = 121 sec • Tc = 0.11 sec • Tc2 = 0.11 sec • Tr = 0.1 sec • Td = 0 sec

Analysis • Clusters of synchronized routers form • Largest cluster (size i) dominates • Processes for iTc seconds after first timer expires • “Jumps” (i - 1)Tc each round on graph • Characterizes state of graph • Synchronized groups can merge

Variation of Tr 1.4 Tc 1.0 Tc 0.6 Tc Unsynchronization -> Synchronization 2.5 Tc 2.8 Tc 2.3 Tc Synchronization -> Unsynchronization

Markov Chain Model • N nodes that implement the Periodic Chain Model • Each state signifies largest cluster size • Smaller clusters are all of size one • Only move one state per round

Cluster Breakup • Assume Tc < 2Tr + Td • Pi,i-1 = P(Tc < L + Td) = i Tc - Td 1 - 1 < i ≤ N 2Tr Tp Tp + Tr Tp - Tr Tc M1 resets M3 expires M2 expires M1 expires Cluster i = 3 timer resets

iTc Tc Cluster Growth • Clusteri has processing time of iTc • First timer expires Tp – Tr(i-1)/(i+1) • Clusteri “jumps” (i-1)Tc – Tr(i-1)/(i+1) compared to single groups • Assumes distance is Tp/(N – i +1) • Pi,i+1 = 1 - e-((N-i+1)/Tp)(i-1)Tc-Tr(i-1)/(i+1)) • P1,2 is left as a variable (dependent on Tr) 1 < i < N

Synchronization • Define f(i) as number of rounds till cluster size first = i starting from 1 See Appendix A for derivation

Breakup • Define g(i) as number of rounds till cluster size = i starting from N

Evaluation of Analysis • Markov model 2~3 order larger than simulation • Rough approximation (Not Predictive) • Captures qualities behavior (Explanatory) • Grossly overestimates for large values of N and Tc

Analysis Results • Choosing Tr as a small multiple of Tc is usually effective at preventing synchronization • Transition to synchronization is abrupt • Paper recommends Tr = Tp / 2 to cover all parameters 3000 years 16 Minutes

Group Size • Steady-state behavior is bimodal • Almost always unsynchronized • Almost always synchronized • Addition of just one node can change synchronization modes Tr = 0.30 Tr = 0.11

Delayed Transmission • In reality, Td≠ 0 even in small latency networks • If Td > Tc, little coupling takes place • When 0 < Td < Tc, synchronization is inversely related to Tc – Td • Remember pi,i-1 = i Tc- Td 1 - 2Tr

Topologies • Assumed mesh model • Model applies to some topologies • Ring • Model breaks for other topologies • Star

Conclusions • Periodic messages from routers can inadvertently synchronize • Emergent behavior with abrupt transition • Synchronization can be overcome • External random component (Tp/2) • Routing timer independent of incoming events • No triggered updates • Account for random bursts of traffic

Discussion • How could have the problem evolved in today’s Internet? • Are there better solutions than adding a large random component? • How would the random component affect the performance of the protocol? • How does synchronization happen on WANs where Td can be very large?

Internet Routing Instability Craig Labovitz G. Robert Malan Farnam Jahanian SIGCOMM ‘97

Overview • Message analysis of inter-domain traffic at major Internet backbones • Rapid changes in node reachability causes network instability • Packet loss • Increased latency • Slower convergence • Connectivity loss!

Internet Background • Internet is composed of various autonomous systems (AS) connected by backbones • Each AS contains disparate administrative and routing policies • AS boundary routers peer routing information about reachability of IP blocks (prefixes)

BGP • Border Gateway Protocol (BGP) is used by AS to exchange updates • Uses incremental updates • Topology changes • Policy changes • Routes are defined by their ASPATH and prefix handle • Peer links are built using TCP -> congestion back-off!

BGP Example • Each AS will append itself to the path during an update • AS’s need to keep a default-free routing table of all visible prefixes AS2 AS3 110.10.0.0/16 AS1 AS5 155.10.0.0/16 AS4 128.10.0.0/16 110.10.0.0/16 AS2 AS3 128.10.0.0/16 AS2 AS4 155.10.0.0/16 AS2 AS4 AS5

BGP Routing Updates • Two forms of updates • Announcements of a new path or destination • Withdrawals or earlier announcements • Explicit – using “withdrawal” message • Implicit – using announcement to bypass AS • During steady-state updates should only occur during • Local policy changes • Network additions

Major Findings • Amount of updates much higher than expected • Pathological and redundant updates dominate routing traffic • Redundant messages are periodic and high-frequency • Update traffic correlates with network usage • Instability cannot be attributed to a small group of AS or routers • Significant amount of forwarding instability occurs

Gauging Internet Instability • Collected BGP messages at various Internet backbones • Taxonomy of BGP updates • Forwarding instability • Routing policy instability • Redundant updates

Methodology • Logged BGP updates at 5 major US exchange points (Mae-East) between Jan ’96 and Jan ’97 • Routing servers peer with +90% of ISPs

Problem with Instability • Non-convergence of routes • Dropped and out-of-order packets • Increase latency • Increase memory and CPU for packet queues • Invalid route caches • Route-flapping • BGP Keep-alive messages are dropped/delayed • Oscillations of detection of overloaded routers • Causes more instability due to multiple topology updates -> more route-flapping

Instability Mitigation • Routing dampening • Ignore updates that surpass a defined threshold of parameters • Wait time period T till restart processing • Legitimate updates can be lost during T • Aggregation • Combine smaller prefixes into a super-prefix • Effective only under planned and cooperative networks • Multi-homed stubs cannot be aggregated well

BGP Update Analysis • Taxonomy of routing events • WADiff: Route explicitly withdrawn, different route is announced • AADiff: Route implicitly withdrawn, different route is announced • WADup: Route explicitly withdrawn, then reannounced later • AADup: Route implicitly withdrawn, then reannounced later • WWDup: Repeated explicit withdrawals • AADiff, WADiff, and WADup is instability • WWDup is pathological instability • AADup can be either

BGP Update Analysis • Typical routing table consist of 45,000 prefixes with 1300 AS • Monitored 3 to 6 million updates exchanged each day • Avg. 125 updates per Network per day • Bursty: sometimes hundreds per sec WWDup not shown

Pathology Analysis • Majority of updates are pathological WWDup (0.5 to 6 million) • Transmitted by routers that never announced the path (stateless BGP) • This problem maybe due to a specific type router/provider • Update exhibit a period of 30 or 60 seconds Vendors subsequently fixed stateless BGP  Not main source of additional updates 

Possible Pathology Origins • Misconfigured CSU clocks • Clocks can drift • Oscillation of valid and corrupted data • Jittered timer with stateless BGP • Synchronization • Non-jitter timers [earlier paper] • Improper interaction with internal gateways

Instability Analysis • Only focus on AADiff, WADiff, and WADup • Temporal Trends • Highest during normal business hours • High during weekend • Low during summer break

Fine-grained Analysis • Focus on Mae East exchange for month of August 1996 • Result: No AS is solely responsible for the instability statistics ISP A was responsible for high amount of international traffic ISP E was going through infrastructure transition

Fine-grained Analysis • Now focus on a per route analysis (ASPATH + Prefix) • Result: No single route consistently dominates the instability statistics 20 to 90% (med 75) had less than 10 80 to 100% had less than 50

Fine-grained Analysis • Temporal properties of update arrival • Measured frequency distribution of instability events • Found that majority (~50%) arrived either on a 30 second or 1 minutes interval • Consistent even for legitimate updates

Conclusion • Instability continues to be a major problem • Over 99% of update events are redundant • Good: Doesn’t effect routing caches • Bad: Sheer volume can cause outages, delays • Instability cannot be attributed to a few guilty ISP, routers, or prefix paths • Exhibits temporal properties • Correlates with network usage • High-frequency periodicity

Follow-up • From Origins of Internet Stability – INFOCOM ’99 • June 1996 – 2 Million packets per day • June 1998 • Several hundred thousands packets per day • More announcements then withdrawals • Majority still duplicate announcements • Oscillating routing announcements occur

Characterising the Use of a Campus Wireless Network David Schwab Rick Bunt INFOCOM 2004

Overview • Analysis of wireless usage at the University of Saskatchewan • Where • When • How much • Trace allows evaluation of network design principles and plans for future development

Campus Characteristics • 40 Buildings with over 363 acres of land • 18,000 students attend the university

Wireless Network Environment • Initial deployment in 2001 with 18 APs • Dispersed through various buildings • Not well advertised • Wireless traffic is routed on a virtual private network with a unique subnet • Cicsco LEAP authentication is used to provide access to wireless

Trace Methodology • Mirrored wireless packets to a computer port monitoring traffic • Used EtherPeek to log packet data • Used LEAP server to track authentication data • Trace began Jan 22, 2003 and lasted one week • Data analyzed with perl script

Sources of Unreliability in Networks

Sources of Unreliability in Networks

Presentation Transcript

Cooperative Communication in Sensor Networks: Relay Channels with Correlated Sources

Biological networks: Types and sources

Sources of bias in RCTs

Networks of Protein Interactions Construction of Networks from Diverse Data Sources

Sources of External Finance in U.S

Sources of Government in Nebraska

The Unreliability of Facebook

Self-Examination and the “Unreliability of Introspection”: Critique of Schwitzgebel

Sources of Law in Canada

Sources of Rights in Canada

Sources Of Microorganisms In Foods

Sources of microorganisms in foods

Sources of Constraints in Computations

Types of Sources Used in Research

Sources of Radiation in the Environment

Condensation in/of Networks

Sources of PM Emissions in Europe

Sources of Law in Canada

Relaying in networks with multiple sources has new aspects:

Sources of Stress in Cats

Effects of Routing Computations in Content-Based Routing Networks with Mobile Data Sources

Sources of Financing in Health Insurances