1 / 28

End-to-End Fault Tolerance Using Transport Layer Multihoming

End-to-End Fault Tolerance Using Transport Layer Multihoming. Armando L. Caro, Jr. Dissertation Proposal April 8, 2003. A 1. B 1. ISP. ISP. Internet. A 2. B 2. ISP. ISP. Host A. Host B. Propose to investigate transport layer multihoming for

rosina
Download Presentation

End-to-End Fault Tolerance Using Transport Layer Multihoming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End-to-End Fault ToleranceUsing Transport Layer Multihoming Armando L. Caro, Jr. Dissertation Proposal April 8, 2003

  2. A1 B1 ISP ISP Internet A2 B2 ISP ISP Host A Host B Propose to investigate transport layer multihoming for • end-to-end fault tolerance (primary goal) • improved application performance (secondary goal)

  3. Need transport layer support to increase connection resilience during path outages Why Investigate Transport Layer Multihoming? • Many applications (e.g., mission-critical) require uninterrupted service • Internet path outages are common • link failures • overloaded links • Multiple network interfaces provide network layer redundancy • interfaces today are relatively cheap

  4. Can’t Routing Handle Path Outages? • Routing does not recover fast enough from link failures • [Labovitz 00] measure failure detection and recovery minimum: 3 minutes often: 10’s of minutes 40% required >30 minutes • [Chandra 01] (using probes) 5% required 2.75 – 27.75 hours! • [Paxson 97] (using probes) 1.5 – 3.3% of routes had “serious pathologies” • [Labovitz 98] (examining routing table logs) 10% of routes available < 95% of time 65% of routes available < 99.99% of time • Routing does not recover at all from overloaded links • Flash crowds • DoS attacks - statistics in [Moore 01]

  5. ISP ISP Internet ISP ISP Host A Host B A1 B1 A2 B2 SCTP Multihoming • hosts choose 1 of 4 possible TCP connections: • (A1,B1) or (A1,B2) or (A2,B1) or (A2,B2) • 1 SCTP association • ({A1,A2}, {B1,B2}) • concept of “primary” destination • Host A → B1 • Host B → A1 • network state (RTT, cwnd, ssthresh, …) maintained per destination

  6. A B Sender: Host A Primary: B1 Alternate: B2 i = 1 j = 2 D Path.Max. Retrans Phase I Phase II Phase III i times out exceeded D primary D primary D primary i i i D errors D failed D active i i i new => D new => D new => D i j i D rtx => D rtx => D rtx => D i j j j responds D responds i A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. Using SCTP Multihoming for Fault Tolerance & Load Balancing. SIGCOMM 2002 Poster, August 2002. Current SCTP Failover Mechanism • Reachability probes • Explicitly with heartbeats • Implicitly with data

  7. SCTP Failover: Issue 1 - Failover is “temporary” Issue 2 - Retransmission Policy Issue 3 - Failure Detection Time Issue 4 - No Source Interface Selection

  8. SCTP Failover: Issue 1- Failover is “temporary”

  9. We found returning to the primary may be inefficient* *A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. A Two-level Threshold Recovery Mechanism for SCTP. SCI 2002, July 2002. SCTP Failover: Issue 1- Failover is “temporary” • Current failover policy • Traffic is redirected back to primary when primary responds to a single heartbeat • i.e., primary destination is never changed • Why keep the primary destination? • Assumes application has a preferred destination • at time of return • primary’s cwnd = 1MTU & ssthresh = 2MTU • alternate’s cwnd > 1MTU & ssthresh > 2MTU • We propose to investigate “permanent” failover when no destination is preferred • One successful heartbeat may not accurately indicate recovered path outages • overloaded links may need more probing • We propose to investigate other probing techniques

  10. A B D responds i i ó j i = 1 j = 2 α β D Phase I Phase II Phase III Phase IV i times out D primary D primary D primary D primary i i j i D errors D failed D failed D active i i i i new => D new => D new => D new => D i j j i D rtx => D rtx => D rtx => D rtx => D i i j j i responds D responds i A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. A Two-level Threshold Recovery Mechanism for SCTP. SCI 2002, July 2002. A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. Using SCTP Multihoming for Fault Tolerance & Load Balancing. SIGCOMM 2002 Poster, August 2002. Two-level Threshold Failover α = temporary failover β = auto change primary

  11. SCTP Failover: Issue 2 – Retransmission Policy

  12. We found that this policy degrades performance in many circumstances* • * A. Caro, P. Amer, R. Stewart. Transport Layer Multihoming for Fault Tolerance in FCS Networks. CTA 2003, April 2003. (Submitted to MILCOM 2003) SCTP Failover: Issue 2 – Retransmission Policy • Current retransmission policy • If peer is multihomed, retransmit to an alternate destination • Why the alternate destination? • Attempts to improve chances of success • No prior research to demonstrate benefits • Not enough traffic on the alternate path to accurately measure RTT …so timeouts are LONG! * • We propose to investigate alternative policies

  13. Potential Solutions • Solution 1: Retransmissions to Same Destination • Pro: uses destination with accurate RTT; cwnd benefits for primary • Con: fewer successful transmits if primary failed • Solutions 2: Heartbeat After RTO (Randall Stewart’s idea) • Pro: immediate opportunity to measure RTT after RTO backoff • Con: still few samples to estimate alternate RTT • Solution 3: Timestamps • Pro: Karn’s Algorithm not needed; more RTT samples on alternate • Con: 12-byte overhead in each packet • Solution 4: Our Multiple Fast Retransmit Algorithm • Pro: minimizes number of timeouts • Con: no extra RTT samples on alternate • Solution 5: Rtx to Same Destination & Multiple Fast Rtx A. Caro, P. Amer, J. Iyengar, R. Stewart. Retransmission Policies with Transport Layer Multihoming. UD CIS TR2003-05, March 2003. (submitted to ICON 2003)

  14. Simulation Topology

  15. Methodology • A→B traffic • 4MB file transfer • Packet sizes: 100% @ 1500B • Cross-traffic • Self-similar (aggregation of Pareto sources) • Packet sizes: 50% @ 40B, 25% @ 576B, 25% @ 1500B • Load: 5Mbps – 11Mbps (producing varying loss rates) • Simulation parameters (60 runs per combo) • Cross-traffic on primary destination path • Cross-traffic on alternate destination path • Retransmission policy (current policy, or 1 of 5 solutions)

  16. SCTP Failover: Issue 3 - Failure Detection Time

  17. Best case failure detection is 1+2+4+8+16+32 = 63 seconds! * *A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. A Two-level Threshold Recovery Mechanism for SCTP. SCI 2002, July 2002. SCTP Failover: Issue 3 - Failure Detection Time • Current SCTP recommends static parameter settings: • RTO (min, max): (1, 60) seconds • Path.Max.Retrans: 5 attempts per destination • Heartbeat Interval: 30 seconds • [Jungmaier 02] improves performance by lowering parameter settings, but • their experimental network had • fixed delays (ie, no delay spikes) • no cross-traffic (ie, no congestion) • RTO.Min < 1 second against recommendation in [Allman 99] • We propose to • further investigate static parameter settings in a more realistic environment • investigate dynamically changing parameters based on • path metrics (RTT, loss rate) • application requirements (high throughput, low delay, low loss)

  18. Congestion Control Improvement • Introduce Fast Recovery mechanism • Avoids multiple cwnd reductions in a single RTT • Similar to New-Reno TCP’s Fast Recovery • Introduce new policy which restricts cwnd increasing during Fast Recovery • Maintains conservative behavior • Modify SCTP’s Fast Retransmit • Avoids unnecessary delays of retransmissions A. Caro, K. Shah, J. Iyengar, P. Amer, R. Stewart. SCTP and TCP Variants: Congestion Control Under Multiple Losses. UD CIS TR2003-04, February 2003. (submitted to ACM CCR) R. Stewart, L. Ong, I. Arias-Rodriguez, K. Poon, P. Conrad, A. Caro, M. Tuexen. SCTP Implementer’s Guide. draft-ietf-tsvwg-sctpimpguide-08.txt, March 2003.

  19. Drop Scenarios One drop Two drops Three drops Four drops Scenarios from: Kevin Fall and Sally Floyd. Simulation-based Comparisons of Taho, Reno, and SACK TCP. In ACM Computer Communications Review, 26(3):5-21, July 1996.

  20. SCTP Failover: Issue 4 - No Source Interface Selection

  21. A B Sender: Host A Primary: B1 Alternate: B2 SCTP Failover: Issue 4 - No Source Interface Selection • Current SCTP • transport sender only specifies destination IP address • but network layer determines outgoing source IP address/interface • Why is this a problem? • Suppose A’s network layer routes packets to B1 & B2 via A1

  22. SCTP Failover: Issue 4 (cont’d) • For full multihoming flexibility • endpoint’s IP stack should support multiple default routes • SCTP should specify the source-destination pair for sending traffic • Stewart and Lei’s KAME implementation • supports experimental options for source interface selection • maintains network state per destination • varies source address to same destination until destination failure detected • [Kubo 03] propose a failover scheme that • maintains network state per source-destination pair • detects failures per source-destination pair • We propose to further investigate source interface selection

  23. Plan of Study

  24. Plan of Study (in progress) • Retransmission policies with multihoming (issue 2) • other file sizes • other cross traffic types (Exponential aggregate, etc) • SCTP vs TCP variants under multiple losses (issue 3) • more extensive loss scenarios • Analytic SCTP model (issues 1 & 3) • build on TCP models in [Padhye 98] and [Cardwell 00] • use to investigate static failover parameter settings

  25. Plan of Study (future) • Adaptive failover algorithm (issue 3) • dynamically adjust thresholds based on • path metrics (RTT, loss rate) • app requirements (high throughput, low delay, low loss) • Probing mechanism (issue 1) • investigate use of packet pairs or small packet trains • Source interface selection (issue 4) • evaluate proposed solutions by [Stewart KAME] and [Kubo 03] • investigate other possible solutions • Final failover mechanism evaluation • simulation • empirical study

  26. Any Questions?

More Related