1 / 103

A Tutorial on Microsoft Cluster Server ™

A Tutorial on Microsoft Cluster Server ™. Outline. Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A. Cluster Goals. Manageability Manage nodes as a single system Perform server maintenance without affecting users

azure
Download Presentation

A Tutorial on Microsoft Cluster Server ™

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tutorial on Microsoft Cluster Server™

  2. Outline • Cluster Abstractions • Cluster Architecture • Cluster Implementation • Application Support • Q&A

  3. Cluster Goals • Manageability • Manage nodes as a single system • Perform server maintenance without affecting users • Mask faults, so repair is non-disruptive • Availability • Restart failed applications & servers • un-availability ~ MTTR / MTBF , so quick repair. • Detect/warn administrators of failures • Scalability • Add nodes for incremental • processing • storage • bandwidth

  4. Fault Model • Failures are independentSo, single fault tolerance is a big win • Hardware fails fast (blue-screen) • Software fails-fast (or goes to sleep) • Software often repaired by reboot: • Heisenbugs • Operations tasks: major source of outage • Utility operations • Software upgrades

  5. Client PCs Printers Server A Server B Interconnect Disk array B Disk array A Cluster: Servers Combined to Improve Availability & Scalability • Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image). • Node: A server in a cluster. May be an SMP server. • Interconnect: Communications link used for intra-cluster status info such as “heartbeats”. Can be Ethernet.

  6. Microsoft Cluster Server™ • 2-node availability Summer 97 (20,000 Beta Testers now) • Commoditize fault-tolerance (high availability) • Commodity hardware (no special hardware) • Easy to set up and manage • Lots of applications work out of the box. • 16-node scalability later (next year?)

  7. Browser Server 1 Server 2 Failover Example Server 1 Server 2 Web site Web site Database Database Web site files Database files

  8. MS Press Failover Demo • Client/Server • Software failure • Admin shutdown • Server failure Resource States - Pending - Partial - Failed - Offline !

  9. Server “Alice” SMP Pentium® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Server “Betty” SMP Pentium® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Interconnect standard Ethernet Local Disks Local Disks Shared Disks Client Windows NT Workstation Internet Explorer MS Press OLTP app Administrator Windows NT Workstation Cluster Admin SQL Enterprise Mgr Demo Configuration SCSI Disk Cabinet Windows NT Server Cluster

  10. Local Disks Local Disks Shared Disks • Cluster Admin Console • Windows GUI • Shows cluster resource status • Replicates status to all servers • Define apps & related resources • Define resource dependencies • Orchestrates recovery order • SQL Enterprise Mgr • Windows GUI • Shows server status • Manages many servers • Start, stop manage DBs Demo Administration Server “Alice” Runs SQL Trace Runs Globe Server “Betty” Run SQL Trace SCSI Disk Cabinet Windows NT Server Cluster Client

  11. Generic Stateless ApplicationRotating Globe • Mplay32 is generic app. • Registered with MSCS • MSCS restarts it on failure • Move/restart ~ 2 seconds • Fail-over if • 4 failures (= process exits) • in 3 minutes • settable default

  12. X X AVI Application AVI Application Local Disks Local Disks Shared Disks Alice Fails or Operator Requests move Demo Moving or Failing Over An Application SCSI Disk Cabinet Windows NT Server Cluster

  13. Generic Stateful ApplicationNotePad • Notepad saves state on shared disk • Failure before save => lost changes • Failover or move (disk & state move)

  14. Local Disks Local Disks Shared Disks Demo Step 1: Alice Delivering Service SQL Activity No SQL Activity SQL SQL ODBC ODBC SCSI Disk Cabinet IIS IIS Windows NT Server Cluster IP HTTP

  15. No SQL Activity SQL Activity SQL SQL Local Disks Local Disks ODBC ODBC IIS IIS Shared Disks IP IP 2: Request Move to Betty SCSI Disk Cabinet Windows NT Server Cluster HTTP

  16. No SQL Activity SQL Activity . Local Disks Local Disks Shared Disks IP 3: Betty Delivering Service SQL SQL ODBC ODBC SCSI Disk Cabinet IIS IIS Windows NT Server Cluster

  17. No SQL Activity SQL Activity SQL SQL Local Disks Local Disks ODBC ODBC IIS Shared Disks IIS IP IP 4: Power Fail Betty, Alice Takeover SCSI Disk Cabinet Windows NT Server Cluster

  18. Local Disks Local Disks Shared Disks 5: Alice Delivering Service SQL Activity No SQL Activity SQL ODBC SCSI Disk Cabinet IIS Windows NT Server Cluster IP HTTP

  19. SQL Local Disks ODBC Local Disks IIS Shared Disks 6: Reboot Betty, now can takeover SQL Activity No SQL Activity SQL ODBC SCSI Disk Cabinet IIS Windows NT Server Cluster IP HTTP

  20. Outline • Cluster Abstractions • Cluster Architecture • Cluster Implementation • Application Support • Q&A

  21. Cluster and NT Abstractions Resource Cluster Group Cluster Abstractions NT Abstractions Service Domain Node

  22. Basic NT Abstractions Service Domain Node • Service: program or device managed by a node • e.g., file service, print service, database server • can depend on other services (startup ordering) • can be started, stopped, paused, failed • Node: a single (tightly-coupled) NT system • hosts services; belongs to a domain • services on node always remain co-located • unit of service co-location; involved in naming services • Domain: a collection of nodes • cooperation for authentication, administration, naming

  23. Cluster Abstractions Resource Cluster Resource Group • Resource: program or device managed by a cluster • e.g., file service, print service, database server • can depend on other resources (startup ordering) • can be online, offline, paused, failed • Resource Group: a collection of related resources • hosts resources; belongs to a cluster • unit of co-location; involved in naming resources • Cluster: a collection of nodes, resources, and groups • cooperation for authentication, administration, naming

  24. Resources Resource Resources have... • Type: what it does (file, DB, print, web…) • An operational state (online/offline/failed) • Current and possiblenodes • Containing Resource Group • Dependencies on other resources • Restart parameters (in case of resource failure) Cluster Group

  25. Built-in types Generic Application Generic Service Internet Information Server (IIS) Virtual Root Network Name TCP/IP Address Physical Disk FT Disk (Software RAID) Print Spooler File Share Added by others Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard). Resource Types

  26. Physical Disk

  27. TCP/IP Address

  28. Network Name

  29. File Share

  30. IIS (WWW/FTP) Server

  31. Print Spooler

  32. Resources states: Offline: exists, not offering service Online:offering service Failed:not able to offer service Resource failure may cause: local restart other resources to go offline resource group to move (all subject to group and resource parameters) Resource failure detected by: Polling failure Node failure Online Online Pending Failed Offline Resource States I’m Online! Go Off-line! Offline Pending I’m here! Go Online! I’m Off-line!

  33. File Share Network Name IIS Virtual Root IP Address Resource DLL Resource Dependencies • Similar to NT Service Dependencies • Orderly startup & shutdown • A resource is brought online after any resources it depends on are online. • A Resource is taken offline before any resources it depends on • Interdependent resources • Form dependency trees • move among nodes together • failover together • as per resource group

  34. Dependencies Tab

  35. NT Registry • Stores all configuration information • Software • Hardware • Hierarchical (name, value) map • Has a open, documented interface • Is secure • Is visible across the net (RPC interface) • Typical Entry: \Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = “GUEST” DefaultDomain = “REDMOND”

  36. Cluster Registry • Separate from local NT Registry • Replicated at each node • Algorithms explained later • Maintains configuration information: • Cluster members • Cluster resources • Resource and group parameters (e.g. restart) • Stable storage • Refreshed from “master” copy when node joins cluster

  37. Name Restart policy (restart N times, failover…) Startup parameters Private configuration info (resource type specific) Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout) These properties are all kept in Cluster Registry Other Resource Properties

  38. General Resource Tab

  39. Advanced Resource Tab

  40. Resource Groups Resource • Every resource belongs to a resource group. • Resource groups move (failover) as a unit • Dependencies NEVER cross groups. (Dependency trees contained within groups.) • Group may contain forest of dependency trees Cluster Group Payroll Group Web Server SQL Server IP Address Drive E: Drive F:

  41. Moving a Resource Group

  42. Group Properties • CurrentState: Online, Partially Online, Offline • Members: resources that belong to group • members determine which nodes can host group. • Preferred Owners: ordered list of host nodes • FailoverThreshold: How many faults cause failover • FailoverPeriod: Time window for failover threshold • FailbackWindowsStart: When can failback happen? • FailbackWindowEnd: When can failback happen? • Everything (except CurrentState) is stored in registry

  43. Failover Failback Failover and Failback • Failover parameters • timeout on LooksAlive, IsAlive • # local restarts in failure window after this, offline. • Failback to preferred node • (during failback window) • Do resource failures affect group? Node \\Betty Node \\Alice Cluster Service Cluster Service IPaddr name

  44. Cluster ConceptsClusters Resource Cluster Group Resource Group Resource Group Resource Group

  45. Cluster Properties • Defined Members: nodes that can join the cluster • Active Members: nodes currently joined to cluster • Resource Groups: groups in a cluster • Quorum Resource: • Stores copy of cluster registry. • Used to form quorum. • Network: Which network used for communication • All properties kept in Cluster Registry

  46. Cluster API Functions(operations on nodes & groups) • Find and communicate with Cluster • Query/Set Cluster properties • Enumerate Cluster objects • Nodes • Groups • Resources and Resource Types • Cluster Event Notifications • Node state and property changes • Group state and property changes • Resource state and property changes

  47. Cluster Management

  48. Outline • Cluster Abstractions • Cluster Architecture • Cluster Implementation • Application Support • Q&A

  49. Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers Architecture Failover Manager Resource Monitor Cluster Registry Global Update Quorum Membership Windows NT Server Cluster Disk Driver Cluster Net Drivers

  50. Membership: Used for orderly addition and removal from{ active nodes } Regroup: Used for failure detection (via heartbeat messages) Forceful eviction from{ active nodes } Membership and Regroup Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers

More Related