MSCS Clustering Implementation

MSCSClustering Implementation Mylex eXtremeRAID 1100 PCI-to-Ultra2 SCSI RAID Controllers

Clustering: Basics

What Are Clusters ? • Group of independent systems that • Function as a single system • Appear to users as a single system • And are managed as a single system • Clusters are “virtual servers” Mylex Confidential

Why Clusters ? • Clusters Improve System Availability • This is the primary value in Wolfpack-I clusters • Clusters Enable Application Scaling • Clusters Simplify System Management • Clusters (with Intel servers) Are Cheap Mylex Confidential

System Availability • Clusters Improve System Availability • When a networked server fails, the service it provided is down • When a clustered server fails, the service it provided “failsover” and downtime is avoided Mail Server Internet Server Mail & Internet Networked Servers Clustered Servers Mylex Confidential

Application Scaling • Clusters Enable Application Scaling • With networked SMP servers, application scaling is limited to a single server • With clusters, applications scale across multiple SMP servers (typically up to 16 servers) Mylex Confidential

Simple Systems Management • Clusters Simplify System Management • Clusters present a Single System Image; the cluster looks like a single server to management applications • Hence, clusters reduce system management costs Three Management Domains One Management Domain Mylex Confidential

Inexpensive • Clusters (with Intel servers) Are Cheap • Essentially no additional hardware costs - Readily Available Hardware (High Volume Server) • Microsoft charges an extra $3K per node • Windows NT Server $1,000 • Windows NT Server, Enterprise Edition $4,000Note: Proprietary Unix cluster software costs $10K to $25K per node. Mylex Confidential

RAID An Analogy to RAID • RAID Makes Disks Fault Tolerant • Clusters make servers fault tolerant • RAID Increases I/O Performance • Clusters increase compute performance • RAID Makes Disks Easier to Manage • Clusters make servers easier to manage Mylex Confidential

Two Flavors of Clusters • High Availability Clusters • Microsoft’s Wolfpack 1 • Compaq’s Recovery Server • Load Balancing Clusters(a.k.a. Parallel Application Clusters) • Microsoft’s Wolfpack 2 • Digital’s VAXClustersNote: Load balancing clusters are a superset of high availability clusters. Mylex Confidential

Mail Web Mail & Web High Availability Clusters • Two node clusters (node = server) • During normal operations, both servers do useful work • Failover • When a node fails, applications failover to the surviving node and it assumes the workload of both nodes Mylex Confidential

Mail Web Mail Web High Availability Clusters (Contd.) • Failback • When the failed node is returned to service, the applications failback Mylex Confidential

3,000 TPM 3,600 TPM Load Balancing Clusters • Multi-node clusters (two or more nodes) • Load balancing clusters typically run a single application, e.g. database, distributed across all nodes • Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear) Mylex Confidential

Load Balancing Clusters (Contd.) • Cluster rebalances the workload when a node dies • If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies Mylex Confidential

Two Cluster Models • “Shared Nothing” Model • Microsoft’s Wolfpack Cluster • “Shared Disk” Model • VAXClusters Mylex Confidential

RAID “Shared Nothing” Model • At any moment in time, each disk is owned and addressable by only one server • “Shared nothing” terminology is confusing • Access to disks is shared -- on the same bus • But at any moment in time, disks are not shared Mylex Confidential

RAID “Shared Nothing” Model (Contd.) • When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients Mylex Confidential

“Shared Disk” Model • Disks are not owned by servers but shared by all servers • At any moment in time, any server can access any disk • Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data) RAID Mylex Confidential

HBA HBA Cluster Interconnect • This is about how servers are tied together and how disks are physically connected to the cluster • Clustered servers always have a client network interconnect, typically Ethernet, to talk to users • And at least one cluster interconnect to talk to other nodes and to disks Client Network Cluster Interconnect RAID Mylex Confidential

NIC NIC HBA HBA Cluster Interconnect (Contd.) • Or They Can Have Two Cluster Interconnects • One for nodes to talk to each other -- “Heartbeat Interconnect” • Typically Ethernet • And one for nodes to talk to disks -- “Shared Disk Interconnect” • Typically SCSI or Fibre Channel Cluster Interconnect Shared Disk Interconnect RAID Mylex Confidential

Microsoft Clustering Service(MSCS) Wolfpack

Clusters Are Not New • Clusters Have been Around Since 1985 • Most UNIX Systems are Clustered • What’s New is Microsoft Clusters • Code named “Wolfpack” • Named Microsoft Cluster Server (MSCS) • Software that provides clustering • MSCS is part of Window NT, Enterprise Server V 4.0 Mylex Confidential

Microsoft Cluster Rollout • Wolfpack-I • In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes Transaction Server and Reliable Message Queue] • Two node “failover cluster” • Shipped October, 1997 • Wolfpack-II • In (or after) Windows 2000, Advanced Server • Borrows component from more robust Tandem and Digital Cluster technology (Compaq technology sharing) • “N” node (probably up to 16) “load balancing cluster” • Beta in 1998 and ship in 1999 ? Mylex Confidential

MSCS (NT/E, 4.0) Overview • Two Node “Failover” Cluster • “Shared Nothing” Model • At any moment in time, each disk is owned and addressable by only one server • Two Cluster Interconnects • “Heartbeat” cluster interconnect • Ethernet • Shared disk interconnect • SCSI (any flavor) • Fibre Channel (SCSI protocol over Fibre Channel) • Each Node Has a “Private System Disk” • Boot disk Mylex Confidential

MSCS (NT/E, 4.0) Topologies • Host-based (PCI) RAID Arrays • External RAID Arrays Mylex Confidential

“Heartbeat” Interconnect Shared Disk Interconnect NIC NIC HBA HBA RAID RAID NT Cluster With Host-Based RAID Array • Each node has • Ethernet NIC -- Heartbeat • Private system disk (generally on an HBA) • PCI-based RAID controller -- SCSI or Fibre • Nodes share access to data disks but do not share data Mylex Confidential

NIC NIC HBA HBA NT Cluster With External RAID Array • Each node has • Ethernet NIC -- Heartbeat • Multi-channel HBA’s connect boot disk and external array • Shared external RAID controller on the SCSI or FC Bus -- Mylex’s DAC-SX, DAC-FL, DAC-FF products “Heartbeat” Interconnect Shared Disk Interconnect RAID Mylex Confidential

Cluster Interconnect and Heartbeats • Cluster Interconnect • Private Ethernet between nodes • Used to transmit “I’m alive” heartbeat messages • Heartbeat Messages • When a node stops getting heartbeats, it assumes the other node has died and initiates failover • In some failure modes both nodes stop getting heartbeats (NIC dies or someone trips over the cluster cable) • Both nodes are still alive • But each thinks the other is dead • Split brain syndrome • Both nodes initiate failover • Who wins? Mylex Confidential

HBA HBA RAID RAID Quorum Disk • Special cluster resource that stores the cluster log • When a node joins a cluster, it attempts to reserve the quorum disk (purple disk) • If the quorum disk does not have an owner, the node takes ownership and forms a cluster • If the quorum disk has an owner, the node joins the cluster Cluster “Heartbeat” Interconnect Disk Interconnect Quorum Disk Mylex Confidential

Quorum Disk (Contd.) • If Nodes Cannot Communicate (no heartbeats) • Then only one is allow to continue operating • They use the quorum disk to decide which one lives • Each node waits, then tries to reserve the quorum disk • Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk • When the other node attempts to reserve the quorum disk, it will find that it’s already owned • The node that doesn’t own the quorum disk then failsover • This is called the Challenge / Defense Protocol Mylex Confidential

Microsoft Cluster Server (MSCS) • MSCS Objects • Lots of MSCS objects but only two we care about • Resources and Groups • Resources • Applications, data files, disks, IP addresses, ... • Groups • Application and related resources like data on disks Mylex Confidential

Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Microsoft Cluster Server (MSCS) • When a server dies, groups failover • When a server is repaired and returned to service, groups failback • Since data on disks is included in groups, disks failover and failback Group: Mail Group: Web Group: Mail Group: Web Group: Mail Group: Web Mylex Confidential

Group: Mail Group: Web Group: Mail Group: Web Group: Mail Group: Web Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Groups Failover • Groups are the entities that failover • And they take their disks with them Group: Mail Group: Mail Group: Mail Mylex Confidential

Microsoft Cluster Certification • Two Levels of Certification • Cluster Component Certification • HBA’s and RAID controllers must be certified • When they pass: • They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/ • They’re eligible for inclusion in cluster system certification • Cluster System Certification • Complete two node cluster • When they pass: • They’re listed on the Microsoft web site • They’ll be supported by Microsoft • Each Certification Takes 30 - 60 Days Mylex Confidential

Mylex’s Clustering Implementation eXtremeRAID 1100PCI-to-Ultra2 SCSI RAID

NIC NIC HBA HBA eXtreme RAID NT Cluster With Host-Based RAID Array • Nodes have: • Ethernet NIC -- Heartbeat • Private system disks (HBA) • PCI-based RAID controller • Nodes share access to data disks but do not share data “Heartbeat” Interconnect 3 Shared Ultra2 Interconnects eXtreme RAID Mylex Confidential

MSCS Requirement for Shared Storage Bus • Local drive is needed for boot OS and file system • At any time, only one node has sole ownership of a shared drive. • MSCS only supports SCSI protocol for shared bus • SCSI commands are required for clustered shared devices • Reserved, Release, Test Unit Ready, Inquiry • Support of DPO(Disable page OUT), FUA(Force Unit Access) in read/write commands • Support of multiple initiators, and ability to handle SCSI Bus Reset and Bus Device Reset • Controller ability to handle cluster partner node shutdown, removal -- SCSI bus transition, reset and termination control • Operating System Control Access Mylex Confidential

NIC NIC HBA HBA eXtreme RAID Mylex RAID Products for MSCS Clustering • Controllers supported -- LVD based • eXtremeRAID - DAC1164P • Recommend LVD mode for long cabling distance (12m). Single End mode is limited to 3m and will require SCSI Bus extender for longer distance “Heartbeat” Interconnect Shared Disk Interconnect eXtreme RAID Mylex Confidential

eXtremeRAID 1100: Technology Ultra2 SCSI 233 MHz SA 110 64-bit PCI eXtremeRAID 1100 Mylex Confidential

SDRAM RISC CPU Flash NVRAM SCSI ASIC /32 /32 /8 /8 /16 LVD SCSI Channel 80MB/sec 40MHz CPU Bridge SCSI ASIC /16 LVD SCSI Channel 80MB/sec SCSI ASIC /16 LVD SCSI Channel 80MB/sec /32 Secondary PCI Bus Host P2P Bridge /64 Host PCI 33MHz /32 PCI 33MHz eXtremeRAID 1100: Architecture Mylex Confidential

Mylex PCI RAID’s Two-node Cluster • Emulate SCSI shared bus requirement through NT mini-port driver and RAID F/W • Treat RAID volume drive as physical disk drives • Support release/reserved and other clustered related SCSI commands in the FW through volume reservation table • Honor DPO and FUA and Flush operation in FW. • RAID configuration, Fault Management, Enclosure Management, Volume Reserve/Release are administrated by Master/Slave mechanism • Establish communication between RAID controllers in the 2-node through back-end SCSI bus -- Heartbeat, Cluster commands and RAID configuration and fault management Mylex Confidential

eXtreme RAID Master-Slave Concept • Master/Slave is a controller concept and is transparent to host system and OS • Master/Slave is independent of the server cluster-node status • First established node will act as master, the later one a slave • If one node fails or goes offline, the surviving node becomes master • Node discovery is initiated by a SCSI Bus Reset and kept alive by heartbeat communication through back-end shared SCSI bus Back-end SCSI Buses Node A Raid Heartbeat & Communication Node B eXtreme RAID Master Slave Mylex Confidential

Master/Slave Perspective • Only master manages RAID configuration changes and fault rebuild process. • Raid configuration and fault management can be initiated from either nodes or invoked from DACCF/GAM. • COD updates are done by master but it will inform Slave to update its NVRAM information. • The master manages the rebuild process and could delegate task to slave. • Enclosure management (SAF-TE) is administrated by Master. • Logic Volume Release/Reserved are communicated between master and slave through backend shared SCSI Bus. Mylex Confidential

Termination Control and Bus Isolation • In a cluster setup, one server node could be powered-on, shutdown or removed for upgrade or maintenance • Mylex Supplied Terminator Switch Box • Contain LVD/SE terminator and fast silicon switches • When server node power is on • Terminator is off and SCSI signal passes through • When server node power is off or removed • Terminator is on and SCSI signal is isolated from the server node Server Node B Server Node A 1164P 1164P Terminator Switch Terminator Switch Disk Box Mylex Confidential

Mylex’s Clustering Support Elements • Two-Node NT 4.0 Clustering only (MSCS) • FW 5.07C for eXtremeRAID • BIOS support for cluster nexus establishment message • DACCF/BCU modification for initiator ID and clustering support • NT miniport drive modification to support clustered related SCSI commands. • GAM driver, Server, Clients no changes TCP/IP GAM Server GAM Client GAM Driver MiniPort BCU DACCF FW BIOS Mylex Confidential

GAM Server GAM Server eXtreme RAID Global Array Management (GAM) • GAM : Client/Server RAID management tool via TCP/IP protocol • Uses Virtual IP for viewing single RAID subsystem image (Could use physical IP to view 2 physical node image if needed) • Either Master/Slave will be viewed depending on the current cluster group. GAM task-request will be communicated through back-end SCSI bus and administrated by Master Controller GAMClient GAMClient GAMClient GAMClient TCP / IP Virtual IP, Single System Image Shared Disk Interconnect eXtreme RAID Mylex Confidential

Mylex Clustering Approach • Same FW, BIOS, Driver and utilities for clustering and non-clustering support • Support full featured Mylex RAID controller functions • Full RAID configuration through DACCF and GAM • Hot Swap, Hot spare, RAID Rebuild • Background Consistent Check • Background Initialization • SAF-TE enclosure management • MORE -- Mylex Online Capacity Expansion and RAID migration are not supported in a cluster configuration • Maintain TPC-C world record leader performance • Minimum impact on master/slave heartbeat monitoring • Write back is disabled for cluster data availability and integrity Mylex Confidential

NIC NIC HBA HBA eXtreme RAID WHQL Clustering Certification • Passed Microsoft SDG 1.0 (Server Design Guide), submit for WHQL certification queue • Passed MSCS HCT 8.0 and Clustering Certification Pre-submission test • MSCS System Validation -- Phase 1 - 3 tested • Tested on Intel Madrona, Nightshade, Sitka based systems • Will submit test log to Microsoft in early DEC. 1998 Cluster Admin Clients Clients ………. “Heartbeat” Interconnect Shared Disk Interconnect eXtreme RAID Mylex Confidential

Mylex Clustering Restrictions • Only support 2 node MSCS clustering • Boot and File system needs to be in local drive, separate from shared bus -- Per MSCS requirement • The shared bus includes all SCSI channels in both controllers. All shared devices should be in the same channel for the 2 clustered controllers • Only SCSI hard disks and SAF-TE devices are allowed on the shared bus. • Write-back caching is disabled • MORE is not supported • SCSI device must be capable of supporting multi-initiators, SCSI bus reset and device reset Mylex Confidential

Mylex: Recommended Installation • Setup controller initiator ID and enable cluster support for each node through DACCF while two-nodes are still separate. • Disable RAID controller BIOS for both node, since the RAID controller is not controlling boot device • Run RAID configuration, using DACCF on one node. • Connect the two node together using Mylex terminator switch box and cabling. • Ready to go --Just follow Microsoft Cluster Server Administrator’s Guide for clustering installation. Mylex Confidential

MSCS Clustering Implementation