Overview of high availability in Microsoft SQL Server

Overview of high availability in Microsoft SQL Server Szymon Wójcik

Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion

PLSSUG Cracow Partners

Introduction • Szymon Wójcik • Experience with MS SQL Server since 2000 (dev/admin) • MCITP: DBA SQL Server 2005 • Interests: • Performance tuning • High availability • Blog – sqlphobosq.wordpress.com • Twitter - @phobosq

Availability [1/5] • One of the concepts defined within ITIL • Ability to perform its agreed function when required • Determined by: • Reliability – how long (MTBF) • Maintainability – how quickly restored (MTRS) • Serviceability – contract conditions • Performance • Security • Confidentiality • Integrity • Availability

Availability [2/5] • Best practice – measure in %: • Agreed Service Time – defined in SLA (Service Level Agreement) • Downtime – duration of service unavailability during Agreed Service Time • Important when planning/deploying a service to understand availability concept

Availability [3/5] – figures for one week Allowed downtime duration per week [hh:mm:ss format]

Availability [4/5] – figures for one year Allowed downtime duration per year [DD.hh:mm:ss format]

Availability [5/5] – important notes • Availability != Uptime (service may be up but unavailable) • Scheduled downtime does not have to cause unavailability (up to definition in SLA)

High availability - definition • System design approach and service implementation that ensures a certain level of operational performance (Wikipedia) • Masks the effects of hardware or software failure • Maintains availability of applications so that perceived downtime is minimized (Microsoft)

High availability != disaster recovery • High availability is used for ensuring for meeting Service Level Target for availability • Disaster recovery is ensuring operational continuity • They can be used complementary – HA can minimize the need of invoking DR, but never replace it

Why to choose high availability • For users: • Minimizes downtime probability • Allows to sustain a failure if properly designed • For administrators: • Simplifies migration effort • Minimizes risk of continuity

Single point of failure • A whole system is as strong as the weakest link

Hardware redundancy • Introduce additional hardware to minimize risk of failure

Hardware redundancy • Not only whole machines may be multiplicated to become fault tolerant • Also components: • Power supplies • CPUs • Hard disks • Network interface cards • Storage controllers

Standby node • A standby node is a machine in a HA system that takes over in case of primary server failure • Three types: • Cold standby – Unplugged, needs to be prepared before use • Warm standby – Ready to use, but requires manual switch • Hot standby – Ready to use, takes over automatically • Fail over = switching from primary to standby • Fail back = return to primary • There may be more than one standby in HA scenario!

Load balancing vs failover • Load balancing – distributing of workload between several peer servers • If one goes down, others take over • Workload distributed by load balancer • Failover – automatic switch to standby • Standby is not active • Switch initiated upon loss of heartbeat

Other points • High availability requires additional costs – multiple components must be present according to design in order to meet requirements • It may become complex to maintain – additional CIs present in environment that need to be kept up-to-date • Hardware design must be followed by software to fully benefit from HA • KISS – Keep It Simple and Stupid

High availability in Microsoft SQL Server • SQL Server, as a RDBMS, provides means for failover scenario • Load balancing is difficult and must be properly designed in order to work • High availability in SQL Server does not prevent logical data corruption – periodic DBCC checks are advised

HA methods overview in SQL Server

Replication • Three server roles in replication: • Publisher • Distributor • Subscriber • Three types: • Snapshot • Transactional • Merge • Two subscription methods: • Push – Distributor pushes articles to Subscribers • Pull – Subscribers downloads from Distributor

Replication topology

Possible application of replication • Create a second copy of data to be used in case of emergency (DR) • Create a copy of data to offload the server (load balancing) • Allow offline users to work with data and upload their changes later (high availability)

Replication agents • External programs which are used to implement replication: • Snapshot Agent: • creates snapshots • Log Reader Agent: • Reads transaction log • Marks transactions for replication • Distribution Agent: • Dispatches transactions to Subscriber • Merge Agent: • Downloads remote and uploads local changes • Resolves conflicts in merge replication

Snapshot replication • Publisher makes a copy of a database which is applied at Subscriber • Good for small, static data: • Whole snapshot is applied every time – the changes which appear after snapshot will be applied with next snapshot • Requires sufficient bandwidth

Transactional replication • Starts with a snapshot • Transactions are recorded at Publisher and replayed at Subscriber • May allow for updatable subcriptions • If Subscriber is offline, records are stored at the Distributor

Merge replication • Starts with a snapshot • Merges changes between Publisher and Subscribers • Allows synchronization via HTTPS (since SQL Server 2008) • Allows the most autonomous design – e.g. mobile users, multiple branch offices working on the same data

Replication how-to • Configure Distributor • Configure Publisher: • Select replication type • Select articles to be published • [Optional] Set up article filtering • Set up security • Configure Subscribers: • Connect to Distributor • Select subscription method • Apply snapshot • [Transactional/merge] Synchronize changes

Failover in replication • Stop subscription • Direct all traffic from Publisher to Subscriber: • Change application connection strings • Change DNS aliases, if required, or • Change IP addresses

Failback in replication • After restoring Publisher, restore a copy of database from Subscriber • Direct all traffic from Subscriber to Publisher • Reestablish the replication

Log shipping • Keeps a standby by automating backup, copy and restore process • Three server roles in log shipping: • Primary • Secondary • Monitor

How it works? [1/2] • Restore a full backup from Primary to Secondary and then: • A job runs on Primary which backs up transaction log • Second job copies the log backup to Secondary • Third job on Secondary restores the log after it’s copied • [Optional] Monitor server tracks performance and incidents

How it works? [2/2]

Failover in log shipping • Copy transaction log backups from primary to secondary • Backup tail of the log on primary • Restore all backups except tail-log with NORECOVERY • Restore tail-log with RECOVERY • Disable log shipping jobs • Redirect client traffic to secondary

Drawbacks of log shipping • You can’t miss a transaction log backup • Network traffic generated has to be considered • You are always behind on Secondary • Secondary is read-only

Database mirroring • Allows to keep your standby up-to-date • Allows automatic failover • Cost-effective alternative to clustering • Available in Standard Edition (2005 – 2008 R2) • Does not require cluster capable hardware • Might be in implemented when Windows Authentication mode is not possible (using certificates)

How it works?

Database mirroring modes • High availability (with witness) • Automatic failover • Synchronous transaction commit (principal commits after mirror confirms it’s commit) • High protection (without witness) • Manual failover • Synchronous transaction commit • High performance (without witness) • Manual failover • Asynchronous transaction commit

Manual failover in database mirroring • Can be done with one mouse click in SSMS • Requires client traffic redirection: • Possible within connection string using Failover Partner command

Automatic failover in database mirroring • Initiated automatically by witness if there is no quorum: • If principal is unavailable, fails over to mirror • Does nothing if mirror becomes unavailable • Fails over also if principal is up but unreachable from network! • Requires client traffic redirection: • Possible within connection string using Failover Partner command

Failover clustering • Provides protection on a server level: • Automatic failover in case of server failure • Fails over logins, endpoints and jobs • Combines multiple machines (nodes) in a single virtual server • Requires cluster-capable hardware: • Shared or common storage • Certified server hardware

Clustering

Failover in a cluster

Summary

Discussion

Overview of high availability in Microsoft SQL Server