370 likes | 680 Views
High Availability Almost Everywhere?. Ramon Kagan Computing and Network Services York University. Agenda. Analysis of previous pseudo-high availability solutions Additions to those solutions to create high availability or higher availability Look at what we have done at YorkU
E N D
High Availability Almost Everywhere? Ramon Kagan Computing and Network Services York University
Agenda • Analysis of previous pseudo-high availability solutions • Additions to those solutions to create high availability or higher availability • Look at what we have done at YorkU • What this achieves over and above high availability • Technical in the middle
The Situation • Lack of financial resources for large mid/main-frame true clustering server • Lack of financial resources for overtime hours for regular maintenance • SLAs – understood or written
Past Solutions • Purchase multiple small systems with combined computing power • Create service “clusters” • Use of load balancing techniques • DNS Shuffle records • Proxies • Switch/Router load balancing • Linux Virtual Server (LVS)
Past Solutions – DNS Shuffle Advantages • Simple to setup • Simple RR scheme • No special requirements for servers Disadvantages • RR doesn’t account for different systems types • Faulty system still redirected to every 1 in N • Sys admin may not be in control of DNS • External DNS update only after TTL
Past Solution – Proxies Advantages • Sys admin in full control • Easily updated on the fly • No special requirements for servers • Ability to log activity Disadvantages • Single point of failure • Proxy capable of entire bandwidth • Faulty system redirected to until redirection rules are modified
Past Solutions – Switch/Router Load Balancing Advantages • No additional hardware • Less “bouncing” around • No special requirements for servers Disadvantages • Expensive Licensing • Sys admin may not be in control • Faulty system still redirect to – health checking still not reliable and delay may be over 30 sec
Past Solutions – LVS - NAT Advantages • No special requirements for servers • Load balancing across VLANs, algorithms • Easily updated on the fly • Sys admin in control Disadvantages • Single point of failure • Server capable of entire bandwidth • Faulty server still redirected to • Issues with persistent connections • Good command of IP tables a must
Past Solutions – LVS - DR Advantages • Easy setup • Bandwidth minimal • Director requirements minimal • Load balancing algorithms Disadvantages • Single point of failure • Load balancing on a single VLAN • Extra ethernet card for Windows servers • Faulty server still redirected to • Issues with persistent connections
Past Solutions - Summary • None meet all the requirements • Need to address single points of failure for proxies and LVS • Need to address fault system redirection for all • High bandwidth services not practical for proxies and LVS – NAT • DNS solutions have TTL issues – especially in failures
KeepaliveD • Addresses single point of failure • Director availability using failover protocols – VRRP2 (RFC 2338) • Addresses faulty system redirections • Real server availability using health-checking • Designed for LVS • Can be manipulated to increase availability for non-LVS linux-based services
Terminology • VIP – virtual IP, aka service IP e.g. webmail.yorku.ca • Real Server – actual host of the service • Server Pool – farm of real servers • Virtual Server – access point to server pool (load balancer or director) • Virtual Service – service being served by virtual server under VIP
Health Checking Framework • 4 avenues for health checking • TCP_CHECK – layer4, basic vanilla TCP connection attempt • HTTP_GET – layer5, performs GET HTTP, computes MD5 sum and validates against the expected value • SSL_GET – same as HTTP_GET but uses SSL connections • MISC_CHECK – the kitchen sink – define your own test parameters for the service and return 0 or 1
Failover – VRRP Framework • Election for control of VIP addresses • Dynamic failover of IPs on failures Main functionalities are: • Failover • VRRP instance synch • Nice Fallback • Advert Packet integrity – via IPSEC – Multicasts • System call capabilities
KeepaliveD & LVS – DR User Director Director Service 2 Cluster Service 3 Cluster Service 1 Cluster
KeepaliveD & LVS – DR User Director Director Service 2 Cluster Service 3 Cluster Service 1 Cluster
KeepaliveD & LVS – DR Configuration Example Section 1 – Global definitions • who to notify, how to notify and whom to notify as global_defs { notification_email { unixteam@yorku.ca } notification_email_from root@orite.ccs.yorku.ca smtp_server 130.63.236.104 smtp_connect_timeout 30 lvs_id CNSLB }
vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 250 smtp_alert advert_int 1 authentication { auth_type AH auth_pass passwrd1 } virtual_ipaddress { 130.63.236.146 130.63.236.223 130.63.236.212 } } vrrp_instance VI_2 { state BACKUP interface eth0 virtual_router_id 91 priority 200 smtp_alert advert_int 1 authentication { auth_type AH auth_pass passwrd2 } virtual_ipaddress { 130.63.236.140 130.63.236.137 } } KeepaliveD & LVS – DR Configuration Example Section 2 – VRRP instance definition
# OPTERA.CCS.YORKU.CA - HTTP (Port 80) virtual_server 130.63.236.137 80 { delay_loop 10 lb_algo wrr lb_kind DR protocol TCP # estrela.ccs.yorku.ca real_server 130.63.236.224 80 { weight 1 HTTP_GET { url { path /index.html digest 254440db00e00a3eb49b266de0d457c9 } connect_timeout 20 nb_get_retry 3 delay_before_retry 15 } } # etoile.ccs.yorku.ca real_server 130.63.236.225 80 { weight 1 HTTP_GET { url { path /index.html digest bd32b6a8c221083362c056c88c2ccb87 } connect_timeout 20 nb_get_retry 3 delay_before_retry 15 } } } KeepaliveD & LVS – DR Configuration Example Section 3 – Virtual Service Definition
KeepaliveD & LVS – DR Configuration Example IPVSADM OUTPUT orite:~# ipvsadm IP Virtual Server version 1.0.10 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP optera.ccs.yorku.ca:www wrr -> estrela.ccs.yorku.ca:www Route 1 0 0 -> etoile.ccs.yorku.ca:www Route 1 0 0
KeepaliveD with a Twist • Some services don’t run seamlessly under LVS • Long term connection-based services like IMAP • TCP timeout issues exist with both over/under compensating • Not a real error, but an annoyance for users as the clients needlessly pop-up a message • Need a different way to get closer to HA and load-balancing
KeepaliveD with a Twist DNS Shuffle IMAP Cluster
KeepaliveD with a Twist • Achievements • Automatic failover on system down • TTL problems resolved • Deficiencies • Health checking – would need to be done at the DNS shuffle record level • Load of one server is transferred completely to another • Additional IP address needed PER server in cluster
vrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 201 priority 55 … virtual_ipaddress { 130.63.236.201 } } vrrp_instance VI_2 { state MASTER interface eth0 virtual_router_id 202 priority 60 … virtual_ipaddress { 130.63.236.202 } } vrrp_instance VI_3 { state BACKUP interface eth0 virtual_router_id 203 priority 45 … virtual_ipaddress { 130.63.236.203 } } vrrp_instance VI_4 { state BACKUP interface eth0 virtual_router_id 204 priority 25 … virtual_ipaddress { 130.63.236.204 } } KeepaliveD with a Twist - Configuration vrrp_instance VI_5 { state BACKUP interface eth0 virtual_router_id 205 priority 15 … virtual_ipaddress { 130.63.236.205 } }
(KeepaliveD with a Twist)2 • For some services LVS &KeepaliveD, and KeepaliveD with a Twist are not enough • Databases are a prime example (MySQL) • In a replicated environment only the master must be written to • LVS & KeepaliveD are excellent for the read-only operations across the replicated environment • Health checking is only sufficient to validate the service, not to take corrective actions
(KeepaliveD with a Twist)2 • How do you deal with a master failure? • Apply LVS & KeepaliveD and KeepaliveD with a Twist simultaenously • LVS & KeepaliveD for read-only operations • KeeapaliveD with a Twist for master failover using the system calls capabilities to make the necessary changes
(KeepaliveD with a Twist)2 Director Director M Write Ops
vrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 96 priority 100 advert_int 1 smtp_alert authentication { auth_type AH auth_pass yourpass } virtual_ipaddress { # mysql.yorku.ca 130.63.236.230 } notify_master "/usr/local/etc/notify_takeover" } #!/bin/sh /usr/bin/mysql –e “stop slave;” sleep 2 /usr/bin/mysql –e “reset master;” /usr/bin/mailx –s “`hostname` has taken over as master for mysql.yorku.ca” unixteam@yorku.ca < /dev/null (KeepaliveD with a Twist)2
(KeepaliveD with a Twist)2 – Future Consideration Director Director M M M
Where we are at YorkULVS & KeepaliveD • Public subnet • 3 directors balancing: • Mail delivery services (3 x Debian Linux) • ClamAV, DCC, MIMEDefang, SpamAssassin, Bogofilter, Procmail, Sendmail • Web-based email for students ( 3 x Debian Linux) • Apache, PHP, Horde, IMP, Turba, Mnemo • Web-based email for staff (2 x Debian Linux) • Apache, PHP, Horde, IMP, Turba, Mnemo • Web Registration and Enrolment (2 x Solaris) • Apache, WebObjects
Where we are at YorkULVS & KeepaliveD • 3 directors (cont’d): • Central web proxy service (2 x Debian Linux) • Apache2 – mod_proxy & mod_rewrite • Central web services (2 x Debian Linux) • Apache & the kitchen sink • LDAP (3 x Debian Linux) • OpenLDAP • Private Subnet • 2 directors balancing: • SSL Proxy Service (2 x Debian Linux) • Apache2 – mod_proxy
Where we are at YorkUKeepaliveD with a Twist • Staff Postoffice – IMAP/POP • 5 Servers (Debian Linux) • UW-IMAP • Student Postoffice – IMAP/POP • 3 Servers (Debian Linux) • Courier – only the IMAP and POP components
Where we are at YorkU(KeepaliveD with a Twist)2 • Project for 2004 • MySQL (3 x Debian Linux) • Health checking to be conducted by 3 public subnet directors • Investigation into multiple masters still pending results
So where’s the balance for maintenance? • LVS & KeepAliveD • Remove a real-server from the service midday with little to no effect on the service • KeepAliveD with a Twist • Remove a real-server during off-hours (turn off KeepAliveD), work on the server next day, add server back in to service after maintenance during off hours • Off hours work can be cron’d
Summarize • KeepaliveD allow us to achieve high availability or close to that for many services • KeepaliveD can be manipulated to be a failover mechanism for Linux systems • It is possible to balance the high uptime and maintenance paradox in many cases