460 likes | 598 Views
Single System Abstractions for Clusters of Workstations. Bienvenido Vélez. What is a cluster?. A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one. Possible System Abstractions. System Abstraction. Characterized by.
E N D
Single System Abstractionsfor Clusters of Workstations Bienvenido Vélez
What is a cluster? A collection of loosely connected self-contained computers cooperating to provide the abstraction of a single one Possible System Abstractions System Abstraction Characterized by Massively parallel processor Fine grain parallelism Multi-programmed system Coarse grain concurrency Independent Nodes Fast interconnects Transparency is a goal
Question Compare three approaches to provide abstraction of a single system for clusters of workstations using the following criteria: • Transparency • Availability • Scalability
Contributions • Improvements to the Microsoft Cluster Service • better availability and scalability • Adaptive Replication • Automatically adapting replication levels to maintain availability as cluster grows
Outline • Comparison of approaches • Transparent remote execution (GLUnix) • Preemptive load balancing (MOSIX) • Highly available servers (Microsoft Cluster Service) • Contributions • Improvements to the MS Cluster Service • Adaptive Replication • Conclusions
Startup (glurun) GLUnix Transparent Remote Execution master node master daemon Execute (make, env) signals fork stdin node daemon “glurun make” user exec make stdout, stderr remote node (selected by master) home node • Dynamic load balancing
GLUnixVirtues and Limitations • Transparency • home node transparency limited by user-level implementation • interactive jobs supported • special commands for running cluster jobs • Availability • detects and masks node failures • master process is single point of failure • Scalability • master process performance bottleneck
MOSIXPreemptive Load Balancing 1 3 4 2 5 node process • probabilistic diffusion of load information • redirects system calls to home node
MOSIXPreemptive Load Balancing Exchange local load with random node delay Consider migrating a process to a node with minimal cost • Keeps load information from fixed number nodes • load = average size of ready queue • cost = f(cpu time) + f(communication) + f(migration time)
MOSIXVirtues and Limitations • Transparency • limited home node transparency • Availability • masks node failures • no process restart • preemptive load balancing limits portability and performance • Scalability • flooding and swinging possible • low communication overhead
clients clients Microsoft Cluster Service (MSCS)Highly available server processes Web SQL MSCS MSCS status status • replicated consistent node/server status database • migrates servers from failed nodes
quorum HTML RDB Microsoft Cluster Service Hardware Configuration ethernet Web SQL bottleneck status status SCSI status single points of failure
MSCSVirtues and Limitations Transparency • server migration transparent to clients Availability • servers migrated from failed nodes • shared disk are single points of failure Scalability • manual static configuration • manual static load balancing • shared disk bus is performance bottleneck
Summary of Approaches System Transparency Availability Scalability GLUnix home node limited single point of failure masks failures no fail-over load balancing bottleneck MOSIX home node transparent masks failures no fail-over load balancing MSCS clients server fail-over single point of failure bottleneck
node 1 node 2 node n Transaction-basedReplication operates on object write[x] replication operates on copies { write[x1], …, write[xn] } transactions
Re-designing MSCS • Idea: New core resource group fixed on every node • special disk resource • distributed transaction processing resource • transactional replicated file storage resource • Implement consensus with transactions (El-Abbadi-Toueg algorithm) • changes to configuration DB • cluster membership service • Improvements • eliminates complex global update and regroup protocols • switchover not required for application data • provides new generally useful service • Transactional replicated object storage
Re-designed MSCSwith transactional replicated object storage Node Cluster Service node manager resource manager RPC Resource Monitor resource DLL resource DLL RPC Transaction Service Replicated Storage Svc network
Adaptive ReplicationProblem What should a replication service do when nodes are added to the cluster? • Must alternate migration with replication • Replication (R) should happen significantly less often that migration (M) replication vs. migration Goal: Maintain availability Hypothesis
Replication increases number of copies of objects 2 nodes X y X y 2 nodes added X y X y X y X y 4 nodes
Migration re-distributes objects across all nodes 2 nodes X y X y 2 nodes added X x y y 4 nodes
Simplifying Assumptions • System keeps same number of copies k of each object • System has n nodes • Initially n = k • n increases k nodesat a time • ignore partitions in computing availability
ConjectureHighest availability can be obtained if objects partitioned in q = n / kgroups living disjoint sets of nodes. Example: k = 3, n = 6, q = 2 X’ X’ X’ q X” X” X” k Lets call this optimal migration
Adaptive Replication Necessary Let each node have availability p The availability of the system is: A(k,n) = 1 - q * pk Since optimal migration always increases q, migration decreases availability (albeit slowly) Adaptive replication may be necessary to maintain availability
Adaptive ReplicationFurther Work • determine when it matters in real situations • relax assumptions • formalize arguments
Talk focuses on Coarse Grain Layer System LCM layers supported Mechanisms Used Berkeley NOW NET, CGP, FGP active Messages, trasparent remote execution, message passing API MOSIX NET, CGP preemptive load balancing kernel-to-kernel RPC MSCS CGP node regroup, resource failover switchover ParaStation NET, FGP user level protocol stack with semaphores
GLUnixCharacteristics • Provides special user commands for managing cluster jobs • Both batch and interactive jobs can be executed remotely • Supports dynamic load balancing
consider Select target node N that minimizes cost C[N] of running p there migrate to N OK? return MOSIX: preemptive load balancing load balance less loaded node exists? no yes Select candidate process p with maximal impact on local load no yes p can migrate? no yes signal p to consider migration return
xFS distribued log-based file system data block data stripes 1 log segment (dirty data blocks) 2 3 parity stripe client writes are always sequential 1 2 3 stripe group
xFSVirtues and Limitations • Exploits aggregate bandwidth of all disks • No need to buy expensive RAID’s • No single point of failure • Reliability: Relies of accumulating dirty blocks to generate large sequential writes • Adaptive replication potentially more difficult
Microsoft Cluster Service (MSCS) GOAL Off-the-shelf Server Application Cluster-aware Server Application Wrapper Highly Available
MSCSAbstractions • Node • Resource • e.g. disks, IP addresses, server • Resource dependency • e.g. DBMS depends on disk holding its data • Resource group • e.g. server and its IP number • Quorum resource • logs configuration data • breaks ties during membership changes
MSCSGeneral Characteristics • Global state of all nodes and resources consistently replicated across all nodes (write all using atomic multicast protocol) • Node and resource failures detected • Resources of failed nodes migrated to surviving nodes • Failed resources restarted
Node Cluster Service node manager resource manager RPC Resource Monitor resource DLL resource DLL RPC resource resource MSCS System Architecture network
MSCS virtually synchronous regroup operation regroup Activate • determine nodes in its connected component • determine if its component is the primary • elect new tie-breaker • if node new tie breaker then broadcast • component as new membership Closing Pruning • if not in the new membership halt Cleanup 1 • install new membership from new tie breaker • acknowledge “ready to commit” Cleanup 2 • if own quorum disk, log membership change end
MSCSPrimary Component Determination Rule • node connected to a majority of previous membership • node connected to half (>=2) of the previous members and one of those is a tie-breaker • node isolated and previous membership had two nodes and node owned quorum resource during previous membership A node is in the primary component if one of the following holds
MSCS switchover SCSI SCSI Every disk a single point of failure! node failure Alternative: Replication
Summary of Approaches System Transparency Availability Performance Berkeley NOW home node limited single point of failure no fail-over load balancing bottleneck MOSIX home node transparent masks failures no fail-over tolerates partitions load balancing low msg overhead MSCS server single point of failure low MTTR tolerates partitions bottleneck
Comparing Approaches Design Goals System LCM layers supported Mechanisms Used Berkeley NOW NET, CGP, FGP active Messages transparent remote execution Message passing API MOSIX NET, CGP preemptive load balancing kernel-to-kernel RPC MSCS CGP cluster membership services resource fail-over ParaStation NET, FGP user level protocol stack network interface hardware
Comparing Approaches Global Information Management System Approach Description Berkeley NOW centralized processes run to completion once assigned to processor MOSIX distributed : probabilistic processes brought offline at source and online at destination MSCS replicated : consistent process migrated at any point during execution
System Failure detection Recovery action Berkeley NOW detected by master daemon timeouts failed nodes removed from central configuration DB MOSIX detected by individual nodes timeouts failed nodes removed from local configuration DB MSCS detected by individual nodes heartbeats failed nodes removed from replicated configuration DB resources restarted/migrated Comparing Approaches Fault-tolerance System Single Points of Failure Possible solution Berkeley NOW master process process pairs MOSIX none N.A. MSCS quorum resource shared disks virtual partitions replication algorithm
Comparing Approaches Load Balancing System Approach Description MSCS manual sys admin manually assigns processes to nodes static processes statically assigned to processors Berkeley NOW dynamic uses dynamic load information to assign processes to processors MOSIX preemptive migrates processes in the middle of their execution
Comparing Approaches Process Migration System Process Migration Approach Description Berkeley NOW none processes run to completion once assigned to processor MSCS cooperative shutdown/restart processes brought offline at source and online at destination MOSIX transparent process migrated at any point during execution
Example: k = 3, n = 3 X x x Each letter (e.g. x above) represents a group of objects with copies in the same subset of nodes
fail-over/ failback switch-over redundancy error-correcting codes replication MSCS RAID xFS primary copy voting (quorum consensus) HARP voting w/ views (virtual partitions)