300 likes | 490 Views
The Design and Architecture of the Microsoft Cluster Service (MSCS) - W. Vogels et al. ECE 845 Presentation By Sandeep Tamboli April 18, 2000. Outline. Prerequisites Introduction Design Goals Cluster Abstractions Cluster Operation Cluster Architecture Implementation Examples Summary.
E N D
The Design and Architecture of the Microsoft Cluster Service (MSCS) - W. Vogels et al. ECE 845 Presentation By Sandeep Tamboli April 18, 2000
Outline • Prerequisites • Introduction • Design Goals • Cluster Abstractions • Cluster Operation • Cluster Architecture • Implementation Examples • Summary
Prerequisites • Availability = MTTF / (MTTF + MTTR) • MTTF: Mean Time To Failure • MTTR: Mean Time To Repair • High Availability: • Modern taxonomy of High Availability: • A system having sufficient redundancy in components to mask certain defined faults, has High Availability (HA). • IBM High Availability Services: • The goals of high availability solutions are to minimize both the number of service interruptions and the time needed to recover when an outage does occur. • High availability is not a specific technology nor a quantifiable attribute; it is a goal to be reached. • This goal is different for each system and is based on the specific needs of the business the system supports. • The presenter: • May have degraded performance while a component is down
MSCS(a.k.a. Wolfpack) • Extension of Windows NT to improve availability • First phase of implementation • Scalability limited up to 2 nodes • MSCS features: • Fail over • Migration • Automated restart • Differences with previous HA solutions: • Simpler User Interface • More sophisticated modeling of applications • Tighter integration with the OS (NT)
MSCS(2) Shared nothing cluster model: • Each node owns a subset of cluster resources • Only one node may own a resource at a time • On failure, another node may take the resource ownership
Design Goals • Commodity • Commercial-off-the-shelf nodes • Windows NT server • Standard Internet protocols • Scalability • Transparency • Presented as a single system to the clients • System management tools manage as if a single server • Service and system execution information available in single cluster wide log
Design Goals(2) • Availability • On failure detection • Restart application on another node • Migrate other resources’ ownership • Restart policy can specify availability requirements of the application • Hardware/software upgrades possible in phased manner
Cluster Abstractions • Node: Runs an instance of Cluster Service • Defined and active • Resource • Functionality offered at a node • Physical: printer • Logical: IP address • Applications implement logical resources • Exchange mail database • SAP applications • Quorum Resource • Persistent storage for Cluster Configuration Database • Arbitration mechanism to control membership • Partition on a fault tolerant shared SCSI disk
Cluster Abstractions(2) • Resource Dependencies • Dependency trees: Sequence to bring resources online • Resource Groups • Unit of migration • Virtual servers • Application runs within virtual server environment • Illusion to applications, administrators, and clients of a single stable environment • Client connects using virtual server name • Enables many application instances to run on a same physical node
Cluster Abstractions(3) • Cluster Configuration Database • Replicated at each node • Accessed through NT registry • updates applied using Global Update Protocol
Member Join • Sponsor broadcasts the identity of the joining node • Sponsor informs the joining node about • Current membership • Cluster configuration database • Joining member’s heartbeats start • Sponsor waits for the first heartbeat • Sponsor signals the other nodes to consider the joining node a full member • Acknowledgement is sent to the joining node • On failure, • Join operation aborted • Joining node removed from the membership
Member Regroup • Upon suspicion that an active node has failed, member regroup operation is executed to detect any membership changes • Reasons for suspicion: • missing heartbeats • power failures • The regroup algorithm moves each node through 6 stages • Each node sends periodic messages to all other nodes, indicating which stage it has finished • Barrier synchronization
Regroup Algorithm • Activate: • After a local clock tick, each node sends and collects status messages • Node advances if all responses collected or timeout occurs • Closing: It is determined if partitions exist and if current node’s partition should survive • Pruning: All nodes that are pruned for lack of connectivity, halt • Cleanup phase one: All the surviving nodes • Install new membership • Mark the halted nodes as inactive • Inform the cluster network manager to filter out halted nodes’ messages • Make event manager invoke local callback handlers announcing node failures • Cleanup phase two: A second cleanup callback is invoked to allow a coordinated two-phase cleanup • Stabilized: The regroup has finished
Partition Survival A partition survives if any of the following is satisfied: • n(new membership) > 1/2 * n(original membership) • Following three conditions satisfied together • n(new membership) = 1/2* n(Original membership) • n(new membership) > 2 • tiebreaker node € (new membership) • Following three conditions satisfied together • n(original membership) = 2 • n(new membership) = 1 • quorum disk € (new membership)
Offline Request to online Shutdown complete Request to offline Init failed Offline-pending Online-pending Failed Init complete Request to offline Online Resource Management • Resource control DLL for each type of resource • Polymorphic design allows easy management of varied resource types • Resource state transition diagram:
Resource Migration: Pushing a group • Executed when • Resource failure at the original node • Resource group prefers to execute at other node • Administrator moves the group • Steps involved: • All resources taken to offline state • A new active host node selected • Brought online at the new node
Resource Migration: Pulling a group • Executed when • The original node fails • Steps involved • A new active host node selected • Brought online at the new node • Nodes can determine the new owner hosts • without communicating with each other • with the help of replicated cluster database
Resource Migration: Fail-back • No automatic migration to preferred owner • Constrained by fail-back window: • How long must the node be up and running • Blackout periods • Fail-back deferred for cost or availability reasons
Global Update Management • Atomic broadcast protocol • If one surviving member receives an update, all the surviving members eventually receive the update • Locker node has a central role • Steps in normal execution: • A node wanting to start global update contacts the locker • When accepted by locker, the sender RPCs to each active node to install the update, in the order of node-ID starting with the node immediately after the locker • Once global update is over, the sender sends the locker an unlock request to indicate successful termination
Failure Conditions • If all the nodes that received update fail => update never occurred • If sender fails during the update operation • Locker reconstructs the update and sends it to each active node • Nodes ignore the duplicate update • If sender and locker both fail after sender installed the update at any node beyond the locker • The next node in the update list is assigned as a new locker • The new locker will complete the update
Support Components • Cluster Network: Extension to the basic OS • Heartbeat management • Cluster Disk Driver: Extension to the basic OS • Shared SCSI bus • Cluster wide Event Logging • Events sent via RPC to all other nodes (periodically) • Time Service • Clock synchronization
Implementation Examples • MS SQL Server • A SQL Server resource group configured as Virtual Server • 2-node cluster can have 2 or more HA SQL Servers • Oracle servers • Oracle Parallel Server • Shared disk model • Uses MSCS to track cluster organization and membership notifications • Oracle Fail-Safe server • Each instance of Fail-Safe database is a virtual server • Upon failure: • The virtual server migrates to the other node • The clients reconnect under the same name and address
Implementation Examples(2) • SAP R/3 • Three-tier client/server system • Normal operation: • One node hosts database virtual server • The other provides application components combined in a server • Upon failure: • The failed virtual server migrates to the surviving node • The application servers are ‘failover aware’ • Migration of the application server needs new login session
Scalability Issues:Join Latency, Regroup messages, GUP Latency, GUP throughput
Summary • A highly available 2-node cluster design using commodity components • Cluster is managed in 3 tiers • Cluster abstractions • Cluster operation • Cluster Service components (interaction with OS) • Design not scalable beyond about 16 nodes
Relevant URLs • A Modern Taxonomy of High Availability • http://www.interlog.com/~resnick/HA.htm • An overview of Clustering in Windows NT Server 4.0, Enterprise Edition • http://www.microsoft.com/ntserver/ntserverenterprise/exec/overview/clustering.asp • Scalability of MSCS • http://www.cs.cornell.edu/rdc/mscs/nt98/ • IBM High Availability Services • http://www.as.ibm.com/asus/highavail2.html • High-Availability Linux Project • http://linux-ha.org/
Discussion Questions • Is clustering the only choice for HA systems? • Why is MSCS in use today despite of its scalability concerns? • Does performance suffer because of HA provisions? Why? • Are geographical HA solutions needed (in order to take care of site disasters)? • This is good for transaction oriented services. What about, say, scientific computing? • Hierarchical clustering?
Glossary • NetBIOS: Short for Network Basic Input Output System, an application programming interface (API) that augments the DOS BIOS by adding special functions for local-area networks (LANs). Almost all LANs for PCs are based on the NetBIOS. Some LAN manufacturers have even extended it, adding additional network capabilities. NetBIOS relies on a message format called Server Message Block (SMB). • SMB: Short for Server Message Block, a message format used by DOS and Windows to share files, directories and devices. NetBIOS is based on the SMB format, and many network products use SMB. These SMB-based networks include Lan Manager, Windows for Workgroups, Windows NT, and LanServer. There are also a number of products that use SMB to enable file sharing among different operating system platforms. A product called Samba, for example, enables UNIX and Windows machines to share directories and files.