High Availability Almost Everywhere?

High Availability Almost Everywhere? Ramon Kagan Computing and Network Services York University

Agenda • Analysis of previous pseudo-high availability solutions • Additions to those solutions to create high availability or higher availability • Look at what we have done at YorkU • What this achieves over and above high availability • Technical in the middle

The Situation • Lack of financial resources for large mid/main-frame true clustering server • Lack of financial resources for overtime hours for regular maintenance • SLAs – understood or written

Past Solutions • Purchase multiple small systems with combined computing power • Create service “clusters” • Use of load balancing techniques • DNS Shuffle records • Proxies • Switch/Router load balancing • Linux Virtual Server (LVS)

Past Solutions – DNS Shuffle Advantages • Simple to setup • Simple RR scheme • No special requirements for servers Disadvantages • RR doesn’t account for different systems types • Faulty system still redirected to every 1 in N • Sys admin may not be in control of DNS • External DNS update only after TTL

Past Solution – Proxies Advantages • Sys admin in full control • Easily updated on the fly • No special requirements for servers • Ability to log activity Disadvantages • Single point of failure • Proxy capable of entire bandwidth • Faulty system redirected to until redirection rules are modified

Past Solutions – Switch/Router Load Balancing Advantages • No additional hardware • Less “bouncing” around • No special requirements for servers Disadvantages • Expensive Licensing • Sys admin may not be in control • Faulty system still redirect to – health checking still not reliable and delay may be over 30 sec

Past Solutions – LVS - NAT Advantages • No special requirements for servers • Load balancing across VLANs, algorithms • Easily updated on the fly • Sys admin in control Disadvantages • Single point of failure • Server capable of entire bandwidth • Faulty server still redirected to • Issues with persistent connections • Good command of IP tables a must

Past Solutions – LVS - DR Advantages • Easy setup • Bandwidth minimal • Director requirements minimal • Load balancing algorithms Disadvantages • Single point of failure • Load balancing on a single VLAN • Extra ethernet card for Windows servers • Faulty server still redirected to • Issues with persistent connections

Past Solutions - Summary • None meet all the requirements • Need to address single points of failure for proxies and LVS • Need to address fault system redirection for all • High bandwidth services not practical for proxies and LVS – NAT • DNS solutions have TTL issues – especially in failures

KeepaliveD • Addresses single point of failure • Director availability using failover protocols – VRRP2 (RFC 2338) • Addresses faulty system redirections • Real server availability using health-checking • Designed for LVS • Can be manipulated to increase availability for non-LVS linux-based services

Terminology • VIP – virtual IP, aka service IP e.g. webmail.yorku.ca • Real Server – actual host of the service • Server Pool – farm of real servers • Virtual Server – access point to server pool (load balancer or director) • Virtual Service – service being served by virtual server under VIP

Health Checking Framework • 4 avenues for health checking • TCP_CHECK – layer4, basic vanilla TCP connection attempt • HTTP_GET – layer5, performs GET HTTP, computes MD5 sum and validates against the expected value • SSL_GET – same as HTTP_GET but uses SSL connections • MISC_CHECK – the kitchen sink – define your own test parameters for the service and return 0 or 1

Failover – VRRP Framework • Election for control of VIP addresses • Dynamic failover of IPs on failures Main functionalities are: • Failover • VRRP instance synch • Nice Fallback • Advert Packet integrity – via IPSEC – Multicasts • System call capabilities

KeepaliveD & LVS – DR User Director Director Service 2 Cluster Service 3 Cluster Service 1 Cluster

KeepaliveD & LVS – DR Configuration Example Section 1 – Global definitions • who to notify, how to notify and whom to notify as global_defs { notification_email { unixteam@yorku.ca } notification_email_from root@orite.ccs.yorku.ca smtp_server 130.63.236.104 smtp_connect_timeout 30 lvs_id CNSLB }

vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 250 smtp_alert advert_int 1 authentication { auth_type AH auth_pass passwrd1 } virtual_ipaddress { 130.63.236.146 130.63.236.223 130.63.236.212 } } vrrp_instance VI_2 { state BACKUP interface eth0 virtual_router_id 91 priority 200 smtp_alert advert_int 1 authentication { auth_type AH auth_pass passwrd2 } virtual_ipaddress { 130.63.236.140 130.63.236.137 } } KeepaliveD & LVS – DR Configuration Example Section 2 – VRRP instance definition

# OPTERA.CCS.YORKU.CA - HTTP (Port 80) virtual_server 130.63.236.137 80 { delay_loop 10 lb_algo wrr lb_kind DR protocol TCP # estrela.ccs.yorku.ca real_server 130.63.236.224 80 { weight 1 HTTP_GET { url { path /index.html digest 254440db00e00a3eb49b266de0d457c9 } connect_timeout 20 nb_get_retry 3 delay_before_retry 15 } } # etoile.ccs.yorku.ca real_server 130.63.236.225 80 { weight 1 HTTP_GET { url { path /index.html digest bd32b6a8c221083362c056c88c2ccb87 } connect_timeout 20 nb_get_retry 3 delay_before_retry 15 } } } KeepaliveD & LVS – DR Configuration Example Section 3 – Virtual Service Definition

KeepaliveD & LVS – DR Configuration Example IPVSADM OUTPUT orite:~# ipvsadm IP Virtual Server version 1.0.10 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP optera.ccs.yorku.ca:www wrr -> estrela.ccs.yorku.ca:www Route 1 0 0 -> etoile.ccs.yorku.ca:www Route 1 0 0

KeepaliveD with a Twist • Some services don’t run seamlessly under LVS • Long term connection-based services like IMAP • TCP timeout issues exist with both over/under compensating • Not a real error, but an annoyance for users as the clients needlessly pop-up a message • Need a different way to get closer to HA and load-balancing

KeepaliveD with a Twist DNS Shuffle IMAP Cluster

KeepaliveD with a Twist • Achievements • Automatic failover on system down • TTL problems resolved • Deficiencies • Health checking – would need to be done at the DNS shuffle record level • Load of one server is transferred completely to another • Additional IP address needed PER server in cluster

vrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 201 priority 55 … virtual_ipaddress { 130.63.236.201 } } vrrp_instance VI_2 { state MASTER interface eth0 virtual_router_id 202 priority 60 … virtual_ipaddress { 130.63.236.202 } } vrrp_instance VI_3 { state BACKUP interface eth0 virtual_router_id 203 priority 45 … virtual_ipaddress { 130.63.236.203 } } vrrp_instance VI_4 { state BACKUP interface eth0 virtual_router_id 204 priority 25 … virtual_ipaddress { 130.63.236.204 } } KeepaliveD with a Twist - Configuration vrrp_instance VI_5 { state BACKUP interface eth0 virtual_router_id 205 priority 15 … virtual_ipaddress { 130.63.236.205 } }

(KeepaliveD with a Twist)2 • For some services LVS &KeepaliveD, and KeepaliveD with a Twist are not enough • Databases are a prime example (MySQL) • In a replicated environment only the master must be written to • LVS & KeepaliveD are excellent for the read-only operations across the replicated environment • Health checking is only sufficient to validate the service, not to take corrective actions

(KeepaliveD with a Twist)2 • How do you deal with a master failure? • Apply LVS & KeepaliveD and KeepaliveD with a Twist simultaenously • LVS & KeepaliveD for read-only operations • KeeapaliveD with a Twist for master failover using the system calls capabilities to make the necessary changes

(KeepaliveD with a Twist)2 Director Director M Write Ops

vrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 96 priority 100 advert_int 1 smtp_alert authentication { auth_type AH auth_pass yourpass } virtual_ipaddress { # mysql.yorku.ca 130.63.236.230 } notify_master "/usr/local/etc/notify_takeover" } #!/bin/sh /usr/bin/mysql –e “stop slave;” sleep 2 /usr/bin/mysql –e “reset master;” /usr/bin/mailx –s “`hostname` has taken over as master for mysql.yorku.ca” unixteam@yorku.ca < /dev/null (KeepaliveD with a Twist)2

(KeepaliveD with a Twist)2 – Future Consideration Director Director M M M

Where we are at YorkULVS & KeepaliveD • Public subnet • 3 directors balancing: • Mail delivery services (3 x Debian Linux) • ClamAV, DCC, MIMEDefang, SpamAssassin, Bogofilter, Procmail, Sendmail • Web-based email for students ( 3 x Debian Linux) • Apache, PHP, Horde, IMP, Turba, Mnemo • Web-based email for staff (2 x Debian Linux) • Apache, PHP, Horde, IMP, Turba, Mnemo • Web Registration and Enrolment (2 x Solaris) • Apache, WebObjects

Where we are at YorkULVS & KeepaliveD • 3 directors (cont’d): • Central web proxy service (2 x Debian Linux) • Apache2 – mod_proxy & mod_rewrite • Central web services (2 x Debian Linux) • Apache & the kitchen sink • LDAP (3 x Debian Linux) • OpenLDAP • Private Subnet • 2 directors balancing: • SSL Proxy Service (2 x Debian Linux) • Apache2 – mod_proxy

Where we are at YorkUKeepaliveD with a Twist • Staff Postoffice – IMAP/POP • 5 Servers (Debian Linux) • UW-IMAP • Student Postoffice – IMAP/POP • 3 Servers (Debian Linux) • Courier – only the IMAP and POP components

Where we are at YorkU(KeepaliveD with a Twist)2 • Project for 2004 • MySQL (3 x Debian Linux) • Health checking to be conducted by 3 public subnet directors • Investigation into multiple masters still pending results

So where’s the balance for maintenance? • LVS & KeepAliveD • Remove a real-server from the service midday with little to no effect on the service • KeepAliveD with a Twist • Remove a real-server during off-hours (turn off KeepAliveD), work on the server next day, add server back in to service after maintenance during off hours • Off hours work can be cron’d

Summarize • KeepaliveD allow us to achieve high availability or close to that for many services • KeepaliveD can be manipulated to be a failover mechanism for Linux systems • It is possible to balance the high uptime and maintenance paradox in many cases

Questions?

High Availability Almost Everywhere?

High Availability Almost Everywhere?

Presentation Transcript

High Availability in

HACMP High Availability

HACMP High Availability

HDB++: High Availability with

High Availability HA

Server 2008 High Availability

Informix High Availability Features

Oracle High Availability

High Availability

High-Availability Methods

Vocalcom High Availability Voice

High-Availability of YARN

High Availability

IPsec High Availability

High Availability Deep Dive

Maximizing High Availability

WP3 High Availability Drives

High Availability

Almost Everywhere: Naturally Occurring Arsenic in Wisconsin’s Aquifers

The Rise of Federations…Almost Everywhere

Linux High-Availability Cluster

Section 2: High Availability