The Design and Architecture of the Microsoft Cluster Service (MSCS) - W. Vogels et al.

The Design and Architecture of the Microsoft Cluster Service (MSCS) - W. Vogels et al. ECE 845 Presentation By Sandeep Tamboli April 18, 2000

Outline • Prerequisites • Introduction • Design Goals • Cluster Abstractions • Cluster Operation • Cluster Architecture • Implementation Examples • Summary

Prerequisites • Availability = MTTF / (MTTF + MTTR) • MTTF: Mean Time To Failure • MTTR: Mean Time To Repair • High Availability: • Modern taxonomy of High Availability: • A system having sufficient redundancy in components to mask certain defined faults, has High Availability (HA). • IBM High Availability Services: • The goals of high availability solutions are to minimize both the number of service interruptions and the time needed to recover when an outage does occur. • High availability is not a specific technology nor a quantifiable attribute; it is a goal to be reached. • This goal is different for each system and is based on the specific needs of the business the system supports. • The presenter: • May have degraded performance while a component is down

MSCS(a.k.a. Wolfpack) • Extension of Windows NT to improve availability • First phase of implementation • Scalability limited up to 2 nodes • MSCS features: • Fail over • Migration • Automated restart • Differences with previous HA solutions: • Simpler User Interface • More sophisticated modeling of applications • Tighter integration with the OS (NT)

MSCS(2) Shared nothing cluster model: • Each node owns a subset of cluster resources • Only one node may own a resource at a time • On failure, another node may take the resource ownership

Design Goals • Commodity • Commercial-off-the-shelf nodes • Windows NT server • Standard Internet protocols • Scalability • Transparency • Presented as a single system to the clients • System management tools manage as if a single server • Service and system execution information available in single cluster wide log

Design Goals(2) • Availability • On failure detection • Restart application on another node • Migrate other resources’ ownership • Restart policy can specify availability requirements of the application • Hardware/software upgrades possible in phased manner

Cluster Abstractions • Node: Runs an instance of Cluster Service • Defined and active • Resource • Functionality offered at a node • Physical: printer • Logical: IP address • Applications implement logical resources • Exchange mail database • SAP applications • Quorum Resource • Persistent storage for Cluster Configuration Database • Arbitration mechanism to control membership • Partition on a fault tolerant shared SCSI disk

Cluster Abstractions(2) • Resource Dependencies • Dependency trees: Sequence to bring resources online • Resource Groups • Unit of migration • Virtual servers • Application runs within virtual server environment • Illusion to applications, administrators, and clients of a single stable environment • Client connects using virtual server name • Enables many application instances to run on a same physical node

Cluster Abstractions(3) • Cluster Configuration Database • Replicated at each node • Accessed through NT registry • updates applied using Global Update Protocol

Cluster Membership Operation

Member Join • Sponsor broadcasts the identity of the joining node • Sponsor informs the joining node about • Current membership • Cluster configuration database • Joining member’s heartbeats start • Sponsor waits for the first heartbeat • Sponsor signals the other nodes to consider the joining node a full member • Acknowledgement is sent to the joining node • On failure, • Join operation aborted • Joining node removed from the membership

Member Regroup • Upon suspicion that an active node has failed, member regroup operation is executed to detect any membership changes • Reasons for suspicion: • missing heartbeats • power failures • The regroup algorithm moves each node through 6 stages • Each node sends periodic messages to all other nodes, indicating which stage it has finished • Barrier synchronization

Regroup Algorithm • Activate: • After a local clock tick, each node sends and collects status messages • Node advances if all responses collected or timeout occurs • Closing: It is determined if partitions exist and if current node’s partition should survive • Pruning: All nodes that are pruned for lack of connectivity, halt • Cleanup phase one: All the surviving nodes • Install new membership • Mark the halted nodes as inactive • Inform the cluster network manager to filter out halted nodes’ messages • Make event manager invoke local callback handlers announcing node failures • Cleanup phase two: A second cleanup callback is invoked to allow a coordinated two-phase cleanup • Stabilized: The regroup has finished

Partition Survival A partition survives if any of the following is satisfied: • n(new membership) > 1/2 * n(original membership) • Following three conditions satisfied together • n(new membership) = 1/2* n(Original membership) • n(new membership) > 2 • tiebreaker node € (new membership) • Following three conditions satisfied together • n(original membership) = 2 • n(new membership) = 1 • quorum disk € (new membership)

Offline Request to online Shutdown complete Request to offline Init failed Offline-pending Online-pending Failed Init complete Request to offline Online Resource Management • Resource control DLL for each type of resource • Polymorphic design allows easy management of varied resource types • Resource state transition diagram:

Resource Migration: Pushing a group • Executed when • Resource failure at the original node • Resource group prefers to execute at other node • Administrator moves the group • Steps involved: • All resources taken to offline state • A new active host node selected • Brought online at the new node

Resource Migration: Pulling a group • Executed when • The original node fails • Steps involved • A new active host node selected • Brought online at the new node • Nodes can determine the new owner hosts • without communicating with each other • with the help of replicated cluster database

Resource Migration: Fail-back • No automatic migration to preferred owner • Constrained by fail-back window: • How long must the node be up and running • Blackout periods • Fail-back deferred for cost or availability reasons

Cluster Architecture

Global Update Management • Atomic broadcast protocol • If one surviving member receives an update, all the surviving members eventually receive the update • Locker node has a central role • Steps in normal execution: • A node wanting to start global update contacts the locker • When accepted by locker, the sender RPCs to each active node to install the update, in the order of node-ID starting with the node immediately after the locker • Once global update is over, the sender sends the locker an unlock request to indicate successful termination

Failure Conditions • If all the nodes that received update fail => update never occurred • If sender fails during the update operation • Locker reconstructs the update and sends it to each active node • Nodes ignore the duplicate update • If sender and locker both fail after sender installed the update at any node beyond the locker • The next node in the update list is assigned as a new locker • The new locker will complete the update

Support Components • Cluster Network: Extension to the basic OS • Heartbeat management • Cluster Disk Driver: Extension to the basic OS • Shared SCSI bus • Cluster wide Event Logging • Events sent via RPC to all other nodes (periodically) • Time Service • Clock synchronization

Implementation Examples • MS SQL Server • A SQL Server resource group configured as Virtual Server • 2-node cluster can have 2 or more HA SQL Servers • Oracle servers • Oracle Parallel Server • Shared disk model • Uses MSCS to track cluster organization and membership notifications • Oracle Fail-Safe server • Each instance of Fail-Safe database is a virtual server • Upon failure: • The virtual server migrates to the other node • The clients reconnect under the same name and address

Implementation Examples(2) • SAP R/3 • Three-tier client/server system • Normal operation: • One node hosts database virtual server • The other provides application components combined in a server • Upon failure: • The failed virtual server migrates to the surviving node • The application servers are ‘failover aware’ • Migration of the application server needs new login session

Scalability Issues:Join Latency, Regroup messages, GUP Latency, GUP throughput

Summary • A highly available 2-node cluster design using commodity components • Cluster is managed in 3 tiers • Cluster abstractions • Cluster operation • Cluster Service components (interaction with OS) • Design not scalable beyond about 16 nodes

Relevant URLs • A Modern Taxonomy of High Availability • http://www.interlog.com/~resnick/HA.htm • An overview of Clustering in Windows NT Server 4.0, Enterprise Edition • http://www.microsoft.com/ntserver/ntserverenterprise/exec/overview/clustering.asp • Scalability of MSCS • http://www.cs.cornell.edu/rdc/mscs/nt98/ • IBM High Availability Services • http://www.as.ibm.com/asus/highavail2.html • High-Availability Linux Project • http://linux-ha.org/

Discussion Questions • Is clustering the only choice for HA systems? • Why is MSCS in use today despite of its scalability concerns? • Does performance suffer because of HA provisions? Why? • Are geographical HA solutions needed (in order to take care of site disasters)? • This is good for transaction oriented services. What about, say, scientific computing? • Hierarchical clustering?

Glossary • NetBIOS: Short for Network Basic Input Output System, an application programming interface (API) that augments the DOS BIOS by adding special functions for local-area networks (LANs). Almost all LANs for PCs are based on the NetBIOS. Some LAN manufacturers have even extended it, adding additional network capabilities. NetBIOS relies on a message format called Server Message Block (SMB). • SMB: Short for Server Message Block, a message format used by DOS and Windows to share files, directories and devices. NetBIOS is based on the SMB format, and many network products use SMB. These SMB-based networks include Lan Manager, Windows for Workgroups, Windows NT, and LanServer. There are also a number of products that use SMB to enable file sharing among different operating system platforms. A product called Samba, for example, enables UNIX and Windows machines to share directories and files.

The Design and Architecture of the Microsoft Cluster Service (MSCS) - W. Vogels et al.

The Design and Architecture of the Microsoft Cluster Service (MSCS) - W. Vogels et al.

Presentation Transcript

Apache Web Server Architecture

Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research Gray@Microsoft research.Micro

BIS on Microsoft Cluster Server(MSCS) Software

Missionary Sisters of the Sacred Heart of Jesus (MSCs) Stella Maris Province

Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community.

Figure S1: Flow - cytometry markers and mesenchymal differentiation of human WJ-MSC.

The basics of Storage Microsoft clustering Grey File services

Biocalculus: Reflecting the needs of the students

Missionary Sisters of the Sacred Heart of Jesus (MSCs) Stella Maris Province

Fig. S1

Using TSM Backup Archive Client with Microsoft Cluster Services

On the Formality of Graphical Descriptions

Architecture and design of the Future Internet

The Design of System Architecture

Architecture Cluster

Slow Down the Aging Effects and Repair Muscle, Bones & More with MSCs

Architecture Design Firm - Architecture Design Service Provider

Using TSM Backup Archive Client with Microsoft Cluster Services

UCB Millennium and the Vineyard Cluster Architecture