High Availability (HA)

High Availability (HA)

Agenda High Availability Introduction Front-End High Availability User Experience Server Management SQL Back-End High Availability

High Availability Introduction

Understanding Availability What is Availability? By Definition: the level of redundancy applied to a system for ensuring an absolute degree of operational continuity during planned and un-planned outages. The focus of this feature is to provide service availability – i.e. keep the service up and running Define the degree of Availability The secondary objective is to minimize the impact on the user experience in case of failure of any of the Lync assets and infrastructure glitches (AD, DNS, network connectivity, Etc.)

Determining Availability Requirements Designs should be based on Business Requirements and Service Level Agreements (SLA)s What are the business drivers? What SLAs are in place? Are these SLAs shared with other teams (network, hardware, Active Directory, SQL) How do they define “availability” How are “RPO” and “RTO” defined Level of Availability has “potential” implications Cost Complexity Management

Availability 9s Considerations

HA Capabilities within Skype for Business HA: Server failure Server clustering via HLB and Domain Name Service (DNS) load balancing Mechanism built into Skype for Business Server to automatically distribute services and groups of users across various front end servers within an Enterprise Pool Support for choice of technology: SQL Failover Clustering, SQL Always On, SQL Mirroring Support auto-failover (FO)/failback (FB) (with witness) and manual FO/FB Integrated with into the core product tools such as Topology Builder, Skype Server Control Panel and Skype Server Management Shell HA: Back-end failure

Front End High Availability Overview

The Availability Brick Model Evolution

Fabric v3 Within Skype for Business Availability Model - Services Supports MCU Factory, Conference Directory, Routing Group, LYSS Fast failover with full service availability Automatic scaling and load balancing Failover management (Activate/DeActivate API during patching) Performs Primary/secondary nodes election Replication between primary and secondary nodes Availability Model – Users Users are mapped to Groups Each group is a persisted stateful service with up to 3 replicas User requests are serviced by primary replica User Location Routing

Server OS Fabric Considerations Operating system selection impacts the installed version of Windows Fabric during setup: Recommended OS: Windows Server 2012 R2 Note: If migrating from Windows 2008 R2, recommend to deploy side-by-side starting with Windows 2012 R2 instead Latest fixes for Windows Fabric may not be available for older operating systems

Pool Quorum When Servers detect another Server or Cluster to be down based on their own state, they consult the Arbitrator before committing that decision. • Voter system • A minimum number of voters are required to prevent service startup failures and provide for pool failover as shown in the following table:

Example: Pool Quorum – Voter In-Depth Two Server Pool Three Server Pool Six Server Pool C:\ProgramData\Windows Fabric\FabricHostSettings.xml

Pool Startup Fabric Behavior Scenarios At Cluster Boot up A Primary member for each Routing Group service is created The Primary member synchronizes data available in blob store to local database The elected Secondary member for each routing group will be synchronized with the Primary When a Frontend restarts Windows Fabric automatically load balances appropriate services to this Frontend once restart is complete. Front-end is made idle Secondary for services, subsequently to active Secondary To manage any service, only (3) nodes are required with synchronized communication

Fabric Group Based Routing Scenarios All users assigned to a group are homed on same Front End (FE) Groups failover to other registrars within a pool when Primary member fails Groups are rebalanced dynamically when FEs are added/removed Routing Groups assigned to a dedicated Replica Set

Example: Fabric Routing Group Assignment

Intra-Pool Load Balancing & Replication Persistent User Data Synchronous replication to two more FEs (Backup / Replicas) Presence, Contacts/Groups, User Voice Setting, Conferences Synchronous writes to the back end for conference state Transient User Data Not replicated across Front End servers Presence changes due to user activity, including: - Calendar - Inactivity - Phone call Minimal portions of conference data replicated to include: - Active Conference Roster - Active Conference MCUs Limited usage of “Shared” Storage Blob Data rehydration of client endpoints Disaster recovery 17

Replica Set Behavior Three replicas – 1 Primary, 2 Secondaries (quorum) If one replica goes down another one takes over as the primary For 15-30 minutes fabric will not attempt to build another replica* *User Count impacts

Replica Set Behavior Three replicas – 1 Primary, 2 Secondaries (quorum) If one replica goes down another one takes over as the primary For 15-30 minutes fabric will not attempt to build another replica* If during this time one of the two replicas left goes down the replica set is in quorum loss Fabric will wait indefinitely for the two replicas to come up again *User Count impacts

Replica Set Stateful Service Failover Replication OS Stateful Service (Secondary) OS OS Stateful Service (Secondary) Stateful Service (Primary) Node 2 Node 1 Node 3 OS OS Node 5 Node 4 Stateful Service (Primary) Stateful Service (Secondary)

Survivable Branch Routing Group Scenarios What about SBA/SBS-homed users? SBA/SBS will have a pool defined for User Services This pool will contain the Routing Groups for the users assigned to the SBS/SBA One pool can service multiple SBA/SBS Each SBS/SBA gets it’s own unique Routing Group All users homed on SBS/SBA are in the same RG This can include up to 5000 users based on current sizing guidelines This Routing Group will have up to 3 copies, like any other Routing Group Note: Since (1) SBA can be associated to (1) pool, for large environments, SBAs should be staggered across the pools they are associated to provide the highest level of availability possible

Survivable Branch Routing Group Scenarios Let’s check out some SBS users…

Survivable Branch Routing Group Scenarios

Survivable Branch Routing Group Scenarios Let’s add a new SBS to the topology….first we’ll check the Routing Group distribution Now…after publishing the new SBA, let’s look again….

Survivable Branch Routing Group Scenarios After creating users on the new SBS, let’s check the routing group ID Look familiar?

High Availability User Experience Primary Copy Offline

Example: User Experience Now, stop services on POOLA2……

Example: User Experience • Notice that one of the secondary copies was promoted to primary Server restored

Example: User Experience

Example: User Experience Amy’s client logs show her client trying to REGISTER, 301 to POOLA3 (up)

Example: User Experience But what about a 2-FE pool? Is it different because we don’t have 3 copies? Nope…still works fine.

High Availability User Experience All Copies Offline

Example: User Experience Now, stop VMs POOLA4, POOLA5, POOLA2…..

Example: User Experience Amy’s Routing Group is in Quorum Loss (No Primaries)

Example: User Experience HOW DO I GET OUT OF THIS?!?!?! Perform a QuorumLossRecovery on the affected pool.

Example: User Experience

High Availability Server Management Patching

Server Grouping – Upgrade Domains Logical grouping of servers on which software maintenance such as upgrades, and security updates are performed at the same time You cannot lose more than one Upgrade Domain at a time Loss of multiple Upgrade Domains = quorum loss

Upgrade domains and service placements UD:/UpgradeDomain3 UD:/UpgradeDomain1 UD:/UpgradeDomain2 Node 2 Node 3 Node 1 S S S P S P S P Node 5 Node 4 Node 6 S P S S S P S

Upgrade Domains Related to number of FEs in pool at creation time (TB Logic) How can I tell? Get-CsPoolUpgradeReadinessState | Select-Object –ExpandPropertyUpgradeDomains What if I add more FEs to the pool? Depending on initial creation state, more UD may be created, or more servers placed into existing UDs

Example: Topology Builder Upgrade Domain Within this example we see: (1) Upgrade domain for a Standard Edition Pool (1) Upgrade domain for the Monitoring Server Role

Cmdlets Get-CsUserPoolInfo -Identity <user> Primary pool/FEs, secondary pool/FEs, routing group

More Cmdlets Get-CsPoolFabricState Detailed information about all the fabric services running in a pool Get-CsPoolUpgradeReadinessState Returns information indicating whether or not your Lync Registrar pools are ready to be upgraded/patched Reset-CsRoutingGroup Administrators can reset Windows Fabric routing groups that are missing or are otherwise not working correctly. Missing routing groups can be identified by using the Get-CsPoolFabricState cmdlet and the FilterOnMissingReplicas parameter. Skip-CsRoutingGroupAtServiceStartup

Resetting the Pool Reset-CsPoolRegistrarState FullReset – cluster changes 1->Any, 2->Any, Any->2, Any->1, Upgrade Domain changes QuorumLossRecovery – force fabric to rebuild services that lost quorum ServiceReset – voter change (default if no ResetType specified) MachineStateRemoved – removes the specified server from the pool

Windows Fabric v3 Changes Better load balancing to consider replica movement costs Performance improvements, bug fixes (including issues reported by Skype), and improved debug-ability Lease level performance improvements—lease expiration in cluster manifest honored Support for Deactivate API Slowly drain replicas out instead of a sudden spike resulting in heavy load on secondary Prevents Windows Fabric from placing replicas on a FEs that are shutting down Safe upgrade checks Official support for Server 2012 R2

Skype for Business Patching Process Evolution Lync 2013 Skype for Business From (8) steps to (4)!!!

Skype for Business Server patching Simplified workflow leverages Windows Fabric v2/v3 APIs 4 steps: Invoke-CsComputerFailOver to failover (stop) a front end; take the FE out of rotation, move the replicas out Perform patching/upgrade Invoke-CsComputerFailBack to failback (start) a front end; bring FE into active state, move replicas in Repeat for all front ends in the pool

Invoke-CsComputerFailOver Checks for availability of sufficient number of servers Waits for replica stability across the pool - Confirm all replicas exists before taking server down Initiates deactivation of the node; wait for success/failure from windows fabric Stops services after successfully deactivating the node

High Availability (HA)