Clustering

Clustering Next Wave In PC Computing

Cluster Concepts 101 • This section is about clusters in general, we’ll get to Microsoft’s Wolfpack cluster implementation in the next section.

Why Learn About Clusters • Today clusters are a niche Unix market • But Microsoft will bring clusters to the masses • Last October, Microsoft announced NT clusters • SCO announced UnixWare clusters • Sun announced Solaris / Intel clusters • Novell announced Wolf Mountain clusters • In 1998, 2M Intel servers will ship • 100K in clusters • In 2001, 3M Intel servers will ship • 1M in clusters (IDC’s forecast) • Clusters will be a huge market andRAID is essential to clusters

What Are Clusters? • Group of independent systems that • Function as a single system • Appear to users as a single system • And are managed as a single system’ • Clusters are “virtual servers”

Why Clusters • #1. Clusters Improve System Availability • This is the primary value in Wolfpack-I clusters • #2. Clusters Enable Application Scaling • #3. Clusters Simplify System Management • #4. Clusters (with Intel servers) Are Cheap

Mail Server Internet Server Mail & Internet Networked Servers Clustered Servers Why Clusters - #1 • #1. Clusters Improve System Availability • When a networked server fails, the service it provided is down • When a clustered server fail, the service it provided “failsover” and downtime is avoided

Why Clusters - #2 • #2. Clusters Enable Application Scaling • With networked SMP servers, application scaling is limited to a single server • With clusters, applications scale across multiple SMP servers (typically up to 16 servers)

Three Management Domains One Management Domain Why Clusters - #3 • #3. Clusters Simplify System Management • Clusters present a Single System Image; the cluster looks like a single server to management applications • Hence, clusters reduce system management costs

Why Clusters - #4 • #4. Clusters (with Intel servers) Are Cheap • Essentially no additional hardware costs • Microsoft charges an extra $3K per node • Windows NT Server $1,000 • Windows NT Server, Enterprise Edition $4,000 Note: Proprietary Unix cluster software costs $10K to $25K per node.

An Analogy to RAID • RAID Makes Disks Fault Tolerant • Clusters make servers fault tolerant • RAID Increases I/O Performance • Clusters increase compute performance • RAID Makes • Disks Easier to Manage • Clusters make servers easier to manage RAID

Two Flavors of Clusters • #1. High Availability Clusters • Microsoft’s Wolfpack 1 • Compaq’s Recovery Server • #2. Load Balancing Clusters(a.k.a. Parallel Application Clusters) • Microsoft’s Wolfpack 2 • Digital’s VAXClusters • Note: Load balancing clusters are a superset of high availability clusters.

Mail Web Mail & Web High Availability Clusters • Two node clusters (node = server) • During normal operations, both servers do useful work • Failover • When a node fails, applications failover to the surviving node and it assumes the workload of both nodes

Mail Web Mail Web High Availability Clusters • Failback • When the failed node is returned to service, the applications failback

3,000 TPM 3,600 TPM Load Balancing Clusters • Multi-node clusters (two or more nodes) • Load balancing clusters typically run a single application, (e.g. database, distributed across all nodes) • Cluster capacity is increased by adding nodes (but like SMP servers, scaling is less than linear)

Load Balancing Clusters • Cluster rebalances the workload when a node dies • If different apps are running on each server, they failover to the least busy server or as directed by predefined failover policies

Two Cluster Models • #1. “Shared Nothing” Model • Microsoft’s Wolfpack Cluster • #2. “Shared Disk” Model • VAXClusters

RAID #1. “Shared Nothing” Model • At any moment in time, each disk is owned and addressable by only one server • “Shared nothing” terminology is confusing • Access to disks is shared -- on the same bus • But at any moment in time, disks are not shared

RAID #1. “Shared Nothing” Model • When a server fails, the disks that it owns “failover” to the surviving server transparently to the clients

#2. “Shared Disk” Model • Disks are not owned by servers but shared by all servers • At any moment in time, any server can access any disk • Distributed Lock Manager arbitrates disk access so apps on different servers don’t step on one another (corrupt data) RAID

Cluster Interconnect • This is about how servers are tied together and how disks are physically connected to the cluster

Client Network Cluster Interconnect HBA HBA RAID Cluster Interconnect • Clustered servers always have a client network interconnect, typically Ethernet, to talk to users • And at least one cluster interconnect to talk to other nodes and to disks

Cluster Interconnect NIC NIC Shared Disk Interconnect HBA HBA RAID Cluster Interconnects (cont’d) • Or They Can Have Two Cluster Interconnects • One for nodes to talk to each other -- “Heartbeat Interconnect” • Typically Ethernet • And one for nodes to talk to disks -- “Shared Disk Interconnect” • Typically SCSI or Fibre Channel

Micosoft’s Wolfpack Clusters

Clusters Are Not New • Clusters Have been Around Since 1985 • Most UNIX Systems are Clustered • What’s New is Microsoft Clusters • Code named “Wolfpack” • Named Microsoft Cluster Server (MSCS) • Software that provides clustering • MSCS is part of Window NT, Enterprise Server

Microsoft Cluster Rollout • Wolfpack-I • In Windows NT, Enterprise Server, 4.0 (NT/E, 4.0) [Also includes Transaction Server and Reliable Message Queue] • Two node “failover cluster” • Shipped October, 1997 • Wolfpack-II • In Windows NT, Enterprise Server 5.0 (NT/E 5.0) • “N” node (probably up to 16) “load balancing cluster” • Beta in 1998 and ship in 1999

MSCS (NT/E, 4.0) Overview • Two Node “Failover” Cluster • “Shared Nothing” Model • At any moment in time, each disk is owned and addressable by only one server • Two Cluster Interconnects • “Heartbeat” cluster interconnect • Ethernet • Shared disk interconnect • SCSI (any flavor) • Fibre Channel (SCSI protocol over Fibre Channel) • Each Node Has a “Private System Disk” • Boot disk

MSCS (NT/E, 4.0) Topologies • #1. Host-based (PCI) RAID Arrays • #2. External RAID Arrays

NIC NIC HBA HBA RAID RAID NT Cluster with Host-Based RAID Array • Each node has • Ethernet NIC -- Heartbeat • Private system disk (generally on an HBA) • PCI-based RAID controller -- SCSI or Fibre • Nodes share access to data disks but do not share data “Heartbeat” Interconnect Shared Disk Interconnect

“Heartbeat” Interconnect Shared Disk Interconnect NIC NIC HBA HBA RAID NT Cluster with SCSI External RAID Array • Each node has • Ethernet NIC -- Heartbeat • Multi-channel HBA’s connect boot disk and external array • Shared external RAID controller on the SCSI Bus -- DAC SX

NIC NIC HBA HBA NT Cluster with Fibre External RAID Array • DAC SF or DAC FL (SCSI to disks) • DAC FF (Fibre to disks) “Heartbeat” Interconnect Shared Disk Interconnect RAID

MSCS -- A Few of the Details Managers -->

Cluster Interconnect & Heartbeats • Cluster Interconnect • Private Ethernet between nodes • Used to transmit “I’m alive” heartbeat messages • Heartbeat Messages • When a node stops getting heartbeats, it assumes the other node has died and initiates failover • In some failure modes both nodes stop getting heartbeats (NIC dies or someone trips over the cluster cable) • Both nodes are still alive • But each thinks the other is dead • Split brain syndrome • Both nodes initiate failover • Who wins?

Cluster “Heartbeat” Interconnect Disk Interconnect HBA HBA RAID RAID Quorum Disk • Special cluster resource that stores the cluster log • When a node joins a cluster, it attempts to reserve the quorum disk (purple disk) • If the quorum disk does not have an owner, the node takes ownership and forms a cluster • If the quorum disk has an owner, the node joins the cluster

Quorum Disk • If Nodes Cannot Communicate (no heartbeats) • Then only one is allow to continue operating • They use the quorum disk to decide which one lives • Each node waits, then tries to reserve the quorum disk • Last owner waits the shortest time and if it’s still alive will take ownership of the quorum disk • When the other node attempts to reserve the quorum disk, it will find that it’s already owned • The node that doesn’t own the quorum disk then failsover • This is called the Challenge / Defense Protocol

Microsoft Cluster Server (MSCS) • MSCS Objects • Lots of MSCS objects but only two we care about • Resources and Groups • Resources • Applications, data files, disks, IP addresses, ... • Groups • Application and related resources like data on disks

Group: Mail Group: Web Group: Mail Group: Web Group: Mail Group: Web Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Microsoft Cluster Server (MSCS) • When a server dies, groups failover • When a server is repaired and returned to service, groups failback • Since data on disks is included in groups, disks failover and failback

Group: Mail Group: Web Group: Mail Group: Web Group: Mail Group: Web Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Resource Groups Failover • Groups are the entities that failover • And they take their disks with them Group: Mail Group: Mail Group: Mail

Microsoft Cluster Certification • Two Levels of Certification • Cluster Component Certification • HBA’s and RAID controllers must be certified • When they pass: • They’re listed on the Microsoft web site www.microsoft.com/hwtest/hcl/ • They’re eligible for inclusion in cluster system certification • Cluster System Certification • Complete two node cluster • When they pass: • They’re listed on the Microsoft web site • They’ll be supported by Microsoft • Each Certification Takes 30 - 60 Days

Mylex NT Cluster Solutions

Internal vs External RAID Positioning • Internal RAID • Lower cost solution • Higher performance in read-intensive applications • Proven TPC-C performance enhances cluster performance • External RAID • Higher performance in write-intensive applications • Write-back cache is turned-off in PCI-RAID controllers • Higher connectivity • Attach more disk drives • Greater footprint flexibility • Until PCI-RAID implements fibre

Why We’re Better -- External RAID • Robust Active - Active Fibre Implementation • Shipping active - active for over a year • It works in NT (certified) and Unix environments • Have Fibre on the back-end soon • Mirrored Cache Architecture • Without mirrored cache, data is inaccessible or dropped on the floor when a controller fails • Unless you turn-off the write-back cache which degrades write performance by 5x to 30x. • Four to Six Disk Channels • I/O bandwidth and capacity scaling • Dual Fibre Host Ports • NT expects to access data over pre-configured paths • If it doesn’t find the data over the expected path, then I/O’s don’t complete and applications fail

Cluster Interconnect Ultra SCSI Disk Interconnect HBA HBA SX SX SX Active / Active Duplex

Single FC Array Interconnect HBA HBA FC HBA FC HBA SF SF SF (or FL) Active / Active Duplex

FC Disk Interconnect Dual FC Array Interconnect FC HBA FC HBA HBA HBA FC HBA FC HBA SF SF SF (or FL) Active / Active Duplex

HBA HBA FC HBA FC HBA FF FF FF Active / Active Duplex Single FC Array Interconnect

FC HBA FC HBA HBA HBA FC HBA FC HBA FF FF FF Active / Active Duplex Dual FC Array Interconnect

Why We’ll Be Better -- Internal RAID • Deliver Auto-Rebuild • Deliver RAID Expansion • MORE-I Add Logical Units On-line • MORE-II Add or Expand Logical Units On-line • Deliver RAID Level Migration • 0 ---> 1 • 1 ---> 0 • 0 ---> 5 • 5 ---> 0 • 1 ---> 5 • 5 ---> 1 • And (of course) Award Winning Performance

NIC NIC HBA HBA NT Cluster with Host-Based RAID Array • Nodes have: • Ethernet NIC -- Heartbeat • Private system disks (HBA) • PCI-based RAID controller “Heartbeat” Interconnect Shared Disk Interconnect eXtreme RAID eXtreme RAID

Why eXtremeRAID & DAC960PJ Clusters • Typically four or less processors • Offers a less expensive, integrated RAID solution • Can combine clustered and non clustered applications in the same enclosure • Uses today’s readily available hardware

TPC-C Performance for Clusters DAC960PJ Three internal Ultra Channels At 40 MB/sec 66 Mhz I960 processor off-loads RAID management from the host CPU Two External Ultra Channels At 40 MB/sec 32 bit PCI bus between the controller and the server, providing burst data transfer rates up to 132 MB/sec.

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering