An Empirical Examination of Current High-Availability Clustering Solutions’ Performance

An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation November 2003 See actual paper for bibliographical, procedural info, and appropriate academic reference information

HA and Related Technology • Distributed OS • Load Balancing • Disaster Recovery • Fault Tolerance • HA clustering

HA’s defining traits • SPOF avoided by using redundancy • Single image to the outside world using a single virtual IP address and hostname • Automated fault management and recovery • Multiple access paths from each cluster node to each resource group (set of HA services) • Simple abstraction for applications and administrators • Undisrupted (or minimal disrupted) services during failover. “If a computer breaks down, the functions performed by that computer will be handled by some other computer in the cluster.”

A cluster and tester topology

Inter OS Comparison

Subjective Observations • HA clustering is difficult to configure properly and the available documentation is lacking • Multiple machines must be configured simultaneously, often packages and software must be installed and configured in a specific order. • For what should be a loosely-coupled system, there are many interdependencies. • Youn et al suggest that the design of “administration of clusters…needs improvement,” – I agree • Vogels et al state, “Users find it difficult to configure clusters with the desired management … properties. It is difficult to configure applications to be automatically launched in an appropriate order. Lacking solutions to these problems, clusters will remain awkward and time-consuming tools.” - I agree

Objective ConclusionsBased on Empirical Evidence • HA is not a perfect solution for every environment, and may be a bad solution for some, depending on the expected faults. • High failover time for some systems contributes to a lower-than-expected performance of HA systems when compared to non-HA systems. • Failover times need to be significantly smaller than the time required for a reboot or even a restart of a slow-to-start process. • Primary-node negotiation time at boot contributes to poor performance during power outages. • There were cases where clustering is shown to actually decrease the uptime of a service or site.

Q & A

An Empirical Examination of Current High-Availability Clustering Solutions’ Performance