RRAPID : Real-time Recovery based on Active Probing, Introspection, and Decentralization

RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar

Motivation • Today’s internet core has bursty losses • Backbones have low average loss rates (<0.2%), but experience large bursts in loss • Loss durations vary from 10ms to 33.72sec • 6 out of 7 providers experienced large outage periods 10-220sec for 1-2 times per day • Difficult for multimedia applications to recover from repeated loss (e.g. with FEC) • Commonly used restoration techniques insufficient • Link layer recovery, MPLS not yet uniformly deployed • RON too slow (20 sec), not scalable •  real-time recovery desired • “Assessment of VoIP Quality over Internet Backbones,” Markopoulou, Tobagi, Karam (INFOCOM 2002)

Approach • RRAPID:Real-time Recovery based on Adaptive Probing, Introspection, andDampening • Technique: Overlay based, real-time recovery • Use Link-state routing • Determine link cost from packet receipt delay • Adaptively dampen route advertisements • Desirable properties: • Speed: Low end-to-end failure time • Stability: Few route oscillations • Accuracy: Avoid reacting to transient failures • Scalability: Low probing/communication overhead

RS System Architecture: Reaction Mechanism • Route Stabilization (RS): • Dampens route flaps • Adaptive Tracking (AT): • Filters noise • Reacts quickly to changes • Link Cost Estimation (LCE): • Estimates failure probability from packet loss • “Delay-deficit algorithm” AT LCE

--- LCE output ---AT output ---RS output Simulation Results: Layered Control • Show detailed actions of layers • --- LCE output: metric representing probability link has failed • ---AT output: metric with noise filtered • ---RS output: advertised value for link • Red spikes result from back-to-back packet losses • Setup • Link Failure at t=[150s-170s] • Probe every 300ms, 10% loss • Results • First Detection in 0.92s, next at 5.42 • Several false positives due to cold start. Stabilizes in 100s. • 0.92s corresponds to 3 lost probes plus propagation delay of 0.02s

Simulation Results: Reaction Speed • Reaction Speed • Probing faster improves speed • Probing every <400ms can give ~1s reaction times • Loss decreases reaction time • Overhead • Probing every >50ms gives reasonable overhead • Effect of packet loss • Increasing packet loss decreases accuracy • Advertisements and probes are dropped • Subsecond reactions even at 5% loss

Simulation Results: Comparison • Compared RRAPID, RON, and “Oracle-based” routing. • Results: • RON requires 4 to 10x more advertisements than RRAPID • RON’s overhead increases exponentially with probe speed, RRAPID’s overhead increases linearly • Packet loss has an extreme effect on RON, moderate effect on RRAPID

Emulation Results: Real Internet Workload Overlay path 1 • Method • Measured performance on real Internet workload • Traces acquired between UIUC and Stanford • Emulated 2-path overlay topology, one trace for each path • 1 natural failure at time t=[123.4s to 133.7s], introduced two failures from t=[40s to 50s] and t=[60s to 70s] • Result • Stable, sub-second reactions Overlay path 2 --- Number of flows on link #1 ---Number of flows on link #2

Analysis • Simplified model of system • Modeled RS layer as MIAD • Increase by 1, Decrease by 1/k • Advertisement threshold limited to n • Ignored AT layer effects •  n*k state Markov chain • Given: • Probe loss probability p • Number of paths N • Probe interval I • We can determine: • Speed: Average reaction time • Overhead: Average advertisement rate • Found best-case expected Overhead and Reaction time for variable transient loss rates. • Results • Can react quickly, stably for fairly large amounts of transient packet loss • Overhead and reaction time increases super-linearly with loss rate

Conclusions • Can achieve sub-second reactions on most links with reasonable stability • Congested links increase reaction time • Can react well on most internet links • Trade off relationship between overhead and reaction speed • Lossy links worsen reaction time • Hard to react quickly, stably if all paths have >10% loss. • Future work: • Improve scalability with route aggregation • Extend evaluation of system parameters • Consider wider range of topologies, cross traffic, offered loads

RRAPID : Real-time Recovery based on Active Probing, Introspection, and Decentralization