A Survey on High Availability Mechanisms for IP Services 11 October 2005

A Survey on High Availability Mechanisms for IP Services11 October 2005 N. AYARI, FT R&D., D. Barbaron, FT R&D L. Lefevre, INRIA – P. Primet, INRIA • 2005 High Availability and Performance Computing Workshop (HAPCW'2005) • Santa FE, USA

IntroductionDifferent types of clusters • MPP and SMP clusters, • Scalability via CPU and Memory interconnects • Using special purpose hardware and/or software, • High availability through • Job scheduling and migration, • Fault detection and check pointing. • Clusters of independent working nodes • Pretty alternative based on commodity hardware and/or general purpose operating systems • Scalability achieved by efficient distribution of the incoming requests on the available nodes • High availability ? • Service non interruption and service integrity

IntroductionScalability issues in clusters of commodity hw/sw nodes • The request distribution should • Increase performance by • Improving the system responsiveness • Concurrent supported connections per unit of time, • Keeping reasonable response times • When does the bottleneck is observed? • Support upper layer session integrity • Integrity depends on the switching granularity • On a per datagram, connection or session distribution basis.

Switch designs • Can be • Stateless or Statefull • Applies to • Layer 4 switching • Uses 2-4 packet information (TCP/IP Model) • Layer 5 switching • Uses 2-5 packet information (TCP/IP Model)

Stateless vs Statefull switch designsStateless switch design • Stateless switch design • Achieves a better latency by • Processing each datagram independently from its predecessors • Does not maintain any state information • Implements service integrity • On a per connection basis in Layer 4 Switching • Uses hashing to compute the same cluster node for all datagrams originated from the same client identified by <IP @, Port Number, Protocol>. • On a per session basis in layer 5 Switching • Depends on the IP data application • - Cookie based persistency for web traffic • - Cookie Switching • - Cookie based hashing • What about other data applications?

Stateless vs Statefull switch designsStateless switch design limitations • Upper layer session integrity • A request belonging to one session goes to the wrong server • Hash Collisions needs robust hash functions • Fault node handling • When the hash function depends on the number of active nodes • Replaying all sessions when one or more nodes crash • Fair load distribution • The stateless nature uses static load balancing • Source Hashing, • While request have varying service time and service resources • SIP long sessions, FTP bandwidth consuming transfers, etc.

Stateless vs Statefull switch designsStatefull switch designs • It aims to improve both • Upper layer session integrity • Maintaining connection/session STATES • Source and destination IP @, port numbers, transport protocol • - No semantic to delimit a UDP connection • Maintains multiple purpose timers • Avoid maintaining inactive sessions/connections • - DDoS counter measure • Computes statistics on the client's session duration average • Needs to speed up the lookup for each datagram • Use index hashing • Load distribution Fairness • Using service state aware load distribution policies

Stateless vs Statefull switch designsStatefull design limitations • Cost effectiveness • Server state distribution overhead • Efficiency depends on the granularity of the switching operation • Layer 4 or Layer 5 ? • Does layer 4 scale for all IP services? • Load distribution fairness? • Decision taken on the first datagram in a session/connection • Need new mechanisms

Fair Scheduling • How to measure load? • Using a robust, simple, quickly adapted summary metric • CPU, Memory and Disk I/O utilization, • Number of active application processes and connections, • The availability of network protocol buffers, • Number of active users. • Policies? • Static • Randomization, (Weighted) Round Robin, Source/Destination Hashing. • Dynamic (Server/Client state aware) • (Weighted) Least connections, Short Expected delay, Minimum misses,

Unacceptable TLimit Acceptable THigh Fully Utilized TLow Under Utilized Fair Scheduling • Policies? • Dynamic (Server/Client state aware) (cont.) • Cache affinity, • The file is partitioned among the nodes • SIETA (Size Interval Task assignment with equal load),… • The node is determined based on the 'size' of the request • CAP (Client Aware Policy) • Consecutive connections from the same client assigned to the same node • Admission Control Policies • Locality-Based Least-Connection, Locality-Based Least-Connection with Replication.

Fair Scheduling • Policies? • Network traffic based balancing • Focus on predicting the volume of incoming traffic from a source based upon past history • Priority based balancing • Assigns higher priority to some data traffic • Topology based Redirection • Redirect traffic to the cluster nearest the client in terms of • Hop count (static), • Network latency (dynamic). • Application specific Redirection • Layer 5 load balancing specialize back end servers for special contents or services • Etc.

Two Way Architectures One Way Architectures Packet Double Rewriting Packet Single Rewriting Packet Forwarding Packet Tunnelling client switch servers client switch servers Layer 4 SwitchingHow? • Works at the TCP/IP level • Content blind switching Layer 4 switches

Cluster Mgt KTCPVS IPVS Layer 4 Switching A kernel implementation • The IP Virtual Server implementation • Supports NAT, DR, and Tunnelling • As add-on modules in the networking layer of the kernel • Based on the Linux packet filtering and routing capabilities • The Linux Virtual Server • A cluster of independently working nodes, • Using the IPVS load balancer. • Some recommendations [WZ]

Layer 4 switchingPerformance: Single CPU Linux 2.2 LVS-NAT vs. LVS-DR scaling • Performance [Rou2001].

Layer 4 switchingSome Layer 4 switching products

IN Sanity Checks NF_IP_LOCAL_OUT OUT LAYER 2 Functions ROUTING DECISION #1 EXTERNAL SNAT-POST ROUTING DNAT-PREROUTING FORWARD NF_IP_PREROUTING NF_IP_FORWARD INTERNAL OUTPUT POSTROUTING NF_IP_POSTROUTING OUTPUT NF_IP_LOCAL_IN INPUT LOCAL PROCESS DNAT 2 Layer 4 SwitchingThe Net filter Capabilities and Return Code

Packet ROUTING PREROUTING FORWARD POSTROUTING Packet LOCAL_IN LOCAL_OUT LOCAL PROCESS Layer 4 SwitchingThe IPVS Architecture

Marked Packet ROUTING PREROUTING FORWARD POSTROUTING Packet LOCAL_IN LOCAL_OUT LOCAL PROCESS Layer 4 SwitchingPersistency handling

180 180 ACK ACK ACK ACK 200 INVITE 100 180 100 INVITE 100 180 200 INVITE 100 200 INVITE 200 Server Transaction Server Transaction Server Transaction Client Transaction Client Transaction Client Transaction RTP RTP RTP RTP BYE BYE BYE BYE Client Transaction Server Transaction Client Transaction Client Transaction Server Transaction Server Transaction 200 200 200 200 Stateful Proxy (Record Route) Stateful Proxy (Record Route) Stateless Proxy Layer 4 switchingIssues • The persistence template for layer 4 switching may not scale • Example: VoIP data exchange using SIP • Different transport connections for different transaction within the same SIP session. • Session corruption implies datagram losses • More Latency (TCP AIMD)

Layer 5 SwitchingThe solution? • The switch is also the single view of the cluster • The request distribution is done on the basis of • the load estimation on the cluster's nodes • the connection identifiers of the request • <source and destination IP @, source and destination port nb, protocol> • the session identifiers of the request and the content type • Layer 5 header informations • Additional delay • Need to complete the connection to parse the data

Two Way Architectures One Way Architectures TCP Handoff TCP Handoff Variants TCP Gateway TCP Splicing TCP Splicing Variants Layer 5 SwitchingThe solution? Layer 5 switches

Server Client Content Based Switch Application Layer Forwarding Application Layer User Space Kernel Space Transport Layer Receive Buffer Send Buffer Network Layer Layer 5 SwitchingTCP Gateway, the problems • Cost effective • Multiple copies and context switching • The proxy becomes rapidly the bottleneck because it is a two way architecture.

Server Client Content Based Switch Application Layer User Space Kernel Space Transport Layer Receive Buffer Send Buffer Packet Forwarding with Header Translation Network Layer SourcePort DestPort SEQNber ACKNber Len FLG AdvWin CheckSum UrgPtr Options Padding Layer 5 SwitchingTCP Splicing, the Packet Mapping operations. • Modifications also affect • IP pseudo Header • Socket options

Server Client Switch SYN (CSEQ) SYN (PrSEQ), ACK(CSEQ+1) ACK(PrSEQ+1) DATA (CSEQ+1) Scheduling & Packet Rewriting SYN (PsSEQ) SYN (SSEQ), ACK(PsSEQ+1) ACK(SSEQ+1) DATA (PsSEQ+1) DATA (SSEQ+1), ACK(PsSEQ+1+len) Packet Rewriting DATA (PrSEQ+1), ACK(CSEQ+1+len) Layer 5 SwitchingTCP Splicing message timeline, the Delayed Binding.

Layer 5 SwitchingTCP Splicing, the issues • Delayed binding • Double processing overhead • Two way switch mechanism • Buffer size for large scale forwarders • The transition between the control mode and the forwarder mode • Delay the activation of the spliced connection until the buffers got drained. • Forwarding data concurrently with draining the buffers. • End-to-end Flow Control • From Small/Big AdvWin to Big/Small AdvWin

Layer 5 SwitchingTCP Splice improvements • Pre forking TCP splice • Reduce the three way handshake cost • Pre-allocate Server Scheme • Guess Real Server on receipt of the TCP Sync • Etc.

Client Switch Server ConnReq Magic Nber ConnMagic Magic Nber ConnMagic TCP / IP Stack TCP / IP Stack TCP / IP Stack Conn_Info Ack Msg Handoff Reply Forward Module Ack Layer 5 SwitchingTCP Handoff • One way mechanism • Migrate the TCP connection from the Front end to the back end servers using the Handoff protocol Msg/Ack • MagicNber=HdPrIdentifier, ConnMagic=NxtSeqNber, AckMsg informs of the hdoff result • The connection is done without going through the Three Way handshake procedure.

Server Client Switch SYN (CSEQ) SYN (PrSEQ), ACK(CSEQ+1) ACK(PrSEQ+1) DATA (CSEQ+1) Scheduling & Connection Migration Migrate Request (DATA, CSEQ, PrSEQ) DATA (PrSEQ+1), ACK(CSEQ+1+len) ACK ACK Packet Rewriting DATA DATA DATA, ACK FIN ACK ACK FIN FIN Packet Rewriting ACK Layer 5 SwitchingTCP Handoff message timeline

TCP Handoff vs TCP Splice • Based on LVS TCPSP and TCPHA 2.4 kernel implementations • Throughput (13 KB file) • Overhead due to L7 processing front-end -> bottleneck -> low scalability Apache throughput (conn/sec) # Back End nodes in cluster

Layer 5 SwitchingThe limitations • Highly available connections? • Connection failover • One way vs two way architectures • Improvements on TCP Handoff • Actual implementations do not cover all data traffic

Layer 5 SwitchingSome layer 5 switching products

High Availability • How to detect that a member has failed? • Pings, timeouts, • Heartbeat message exchange • Status, cluster transition and retransmission messages • TCPHA include state message exchange • The accuracy of the failure detection • Timeouts with multiple retries detect failure accuracy with high probability • How to recover from failover • a load balancer failover • State synchronization • Subsystem failover • IP Takeover through channel bonding • Application Failover • The Linux watchdog timer interface, etc.

High Availability • More on connection failover • Through connection migration and reliable sockets • Different from TCP Handoff • Include • Migratory TCP • Fault tolerant TCP • Connection passing

High AvailabilityThe accuracy in distributed architectures • DNS: scalability through site redundancy • DNS SRV RR used in service location • Localizing available SIP proxies • The effectiveness of DNS based scalability and failover are corrupted by the DNS cache updates frequency.

Server Pool Server Pool PE PE PE PE PU Registration Name Resolution Registration PE State Update Redundant ENRP Server ENRP Server High AvailabilityThe accuracy in distributed architectures • RSerPool

High availabilityOther tips for distributed architectures • Multicast • Needs explicit support of all routers within the client server path • IP Anycast route redundancy • Different servers running the same service can all have the same anycast @ on one of their interfaces • If server fails, the router will update its route to the nearest available node • Depends on router's update frequency

Conclusion and Future directions • Further work will address • Kernel implementation of layer 5 switching to handle session oriented data transfers. • Improvements on the forwarder kernel component • Fair load distribution in session oriented data transfers. • IPv6 compliance? • Security concerns in connection failover

THANKS

A Survey on High Availability Mechanisms for IP Services 11 October 2005