140 likes | 247 Views
Building and managing production bioclusters. Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004. Ankur Dhanik. Computer Cluster. Computer cluster consists of connected computers/servers/resources acting like a single system.
E N D
Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik
Computer Cluster • Computer cluster consists of connected computers/servers/resources acting like a single system. • System capable of performing tasks previously delegated to machines costing hundreds of thousands to millions of dollar. • More efficient resource utilization. • General benefit of a flexible research-computing infrastructure that can be tuned and adapted to meet changing research and user demands. • Areas of scientific inquiry previously discarded as impossible are now feasible.
Biology and cluster configuration • Cluster configuration influenced by intended application mix. • In bioinformatics, far more commonly seen are serial computing problems, also referred to as “embarrassingly parallel”. • These type of problems can be broken down into a series of independent steps each of which can be completed in any order without affecting result. • For example, large scale bioinformatics sequence analysis, experimentation.
Biology and cluster configuration • Sequence analysis • compare one sequence against a database of many sequences. • vastly increase performance by simply dividing the query sequences up and running multiple searches at the same time on separate machines. • Experimentation • run slight variations of same program thousands or millions of times in a row. • this can be dealt with via loosely coupled compute clusters, also known as compute farms. • Do not need High performance clusters, like Beowulf • good for tightly coupled problems. • The large number of embarrassingly parallel problems is primary driver for widespread adoption of clusters.
A typical biocluster 1 2 N Users Software based distributed resource management (DRM) Ethernet network Small inexpensive servers
Portal architecture Public local area network Portal machine aka ‘master’, ‘head’ or ‘login’ node File server Private cluster network Cluster compute elements
Design considerations • Reliability, Availability, and Security • compute nodes should be anonymous and interchangeable to support non-disruptive troubleshooting, maintenance and upgrade activities. • critical failure points such as fileservers and portal machines need to be duplicated or made resilient to failure. • Flexibility and Scalability • multiple competing users, workflows, projects should be supported simultaneously. • Manageability • administrative overhead should be minimized, this requires methodologies for automating or reducing administration tasks. • software DRM layer needs to ensure that business and scientific priorities can dynamically alter allocation of computing resources.
Pre purchase decisions • DRM • simplifies interaction with cluster from both user level and administration level. • important decision. • most commonly seen DRM software suites for life sciences run Sun Grid Engine (SGE) or Platform (LSF). • when it comes to flexible yet sophisticated resource sharing and job scheduling needs, especially among many different groups or projects, LSF still has edge with respect to functionality and ease of configuration. • installing a sophisticated Grid Engine configuration can be an adventure. • experience suggests that LSF requires least amount of resources to install, configure and maintain over time.
Choosing hardware • Science and scientific application demands should drive hardware configuration. • Absence of specific application benchmarks. • Dual processor Intel Xeon based servers for compute node configuration. • Networking technology • Switched Gigabit Ethernet is affordable and should be the default interconnect for cluster systems. • alternative cluster interconnects such as InfiniBand and Myrinet offer higher performance, but large cost and lack of existing life science application codes capable of benefiting from such technologies.
Choosing hardware • Storage • speed of cluster storage is usually a performance bottleneck. • network storage: computing-storage devices that can be accessed over a computer network, rather than directly being connected to the computer, e.g. NAS (network attached storage), SAN (storage area network) or hybrid architectures. • use of large internal disk drives within each compute node to cache data needed for data intensive cluster jobs.
Deploying, monitoring and management • Maintenance methodology • if a cluster node enters a faulted state, the node is wiped and reinstalled via the network. • the power to the host is cycled. • if the node fails to successfully rejoin the cluster, it is disabled and considered failed. • it is replaced later at the convenience of operator. • Prepackaged methods for handling remote unattended operating system installations and rebuilds, e.g. SystemImager for linux clusters, NetBoot for Apple hardware and Mac OS X.
Conclusions • Building high quality clusters for use in computational biology is a non-trivial task. • It is important to • understand user and application requirements. • actively participate in DRM selection process. • avoid fixation on raw price/ performance figures that might not reflect the true costs of deploying, managing and supporting distributed systems. • beware of total solutions.
Questions & discussion • How difficult or easy it is to detect failure modes – hardware, code, process? • How difficult it is to have cluster with mixed architecture nodes?