SQL Server Disaster Recovery 101

SQL Server Disaster Recovery 101 Dean Vitner dvitner@gmail.com SQL Server UG Zagreb, 16.4.2014.

Transaction log • Every data modification is first written to tlog • Before the transaction commits • Delayed durability in 2014! • Data pages are updated in memory • CHECKPOINT writes dirty pages to disk • Both committed and uncommitted • Frequency depends on recovery interval • When TX rolls back, state of the data prior to update is read from tlog • Space in transaction log is being freed and reused • When exactly this happens, depends on recovery model used

Transaction log and recovery • If SQL Server service shuts down, recovery will happen on the next startup • These can happen: • Dirty pages from uncommitted tx not written to disk • Dirty pages from committed tx written to disk • Dirty pages from uncommitted tx written to disk • Have to be rolled back (undo) • Dirty pages from committed tx not written to disk • Have to be rolled forward (redo) • Without the tlog, transactional consistency of the database cannot be guaranteed

Recovery models • Three recovery models in SQL Server • SIMPLE recovery model • FULL • BULK-LOGGED • Model database is in FULL recovery model • If you want newly created databases be in SIMPLE, change recovery model for model database

SIMPLE recovery model Full and differential backups only Transaction log is not backed up Space in tlog is freed when the data is committed and checkpointed to disk Eliminates the need for managing tlog space Restore only to the most recent backup Consider not using SIMPLE model on production system

FULL recovery model • Full, differential, and log backups • Log backups are required • Space in tlog is freed when the data is committed and checkpointed to disk and the log is backed up • Provides full recoverability • Up to the point of failure • Point in time restore

BULK-LOGGED recovery model • Same as the FULL model, but: • Bulk operation are only minimally logged • In general, only allocation and deallocation of extents is logged • Increased performance and lower log space consumption • Use FULL, but switch to BULK-LOGGED for bulk operations • Bulk insert and bcp • SELECT INTO • ALTER INDEX … REBUILD • etc

Backup and restore • Full, differential, and transaction log backup • Log backup in full and bulk-logged model • Backup strategy • How much data I am permitted to lose? • How much time I am permitted to be down? • Use differential backups to minimize the restore time • Check your backups • If you can’t restore from backup.. • Use WITH CHECKSUM option • Backup chain • Use COPY_ONLY

Full Backups • Creates starting point for all backups • Contains all used pages in db • Can run concurrently to transactions • ~2% processor overhead • Completely non-blocking backup • Transaction log • Database growth/shrink

Full Backups - Operation Transaction Log Mark log Mark log Data changes Checkpoint Mark log Dump all used pages Mark log Copy portion of log between marks

00010 01110 00100 10001 01100 Bitmap Differential Backups • Contains only pages changed since last full backup • Differential bitmap • NOT an incremental backup

TransactionLog • Backup the inactive portion of log • Committed transaction • Clears backed up portion on completion • Only allowed in full recovery or bulk_logged mode

Partial Backups • Partial database • Primary filegroup • Read/write filegroups • Partial differential • Primary filegroup differential • Read/write filegroups differential

2014 backup and restore • Backup on-premise database to cloud BACKUP DATABASE AdventureWorks2012 TO URL = 'https://mystorageaccount.blob.core.windows.net/mycontainer/AdventureWorks2012.bak' WITH CREDENTIAL = 'mycredential' • Supports: • DATABASE, LOG, DIFFERENTIAL, COPY_ONLY, PARTIAL, COMPRESSION, CHECKSUM • Restore from cloud backup • SQL Server Managed Backup to Windows Azure • smart_admin • Backup encryption

I/O errors • Three types • 823 (a hard I/O error) • Page can’t be read from disk • 824 (a soft I/O error) • Page was read, but is torn (inconsistent) • 825 (a read-retry error) • Page was read, but after retries • Logged in msdb..suspect_pages • Input into single-page restore operations

PAGE_VERIFY • Page is composed of 16 512-byte sectors • It is possible to be not written entirely • Allows fast detection of problems • Three options • NONE • TORN_PAGE_DETECTION • Writes two control bits in each sector • CHECKSUM • Calculates and writes a checksum of data on the page • Written last, checked first • Failures cause 824 error

DBCC CHECKDB • Checks the logical and physical integrity of all the objects in the specified database • DBCC CHECKDB db_name WITH ALL_ERRORMSGS, NO_INFOMSGS • How often? It depends. • DBCC CHECKDB could take a long time to run • Depends on • Size of database • Load • Number and type of corruptions • Run DBCC CHECKDB on restored database

DBCC CHECKDB, contd • Did CHECKDB fail? • If it stops before completing successfully, something bad has happened that is preventing CHECKDB from running • This means there is no choice but to restore from a backup as CHECKDB cannot be forced to run (and hence repair) • Examples of fatal (to CHECKDB) errors • 7984 – 7988: corruption in critical system tables • 8967: invalid states within CHECKDB itself • 8930: corrupt metadata in the database such that CHECKDB could not run

Are the corruptions only in non-clustered indexes? • If recommended repair level is REPAIR_REBUILD, then YES! • Otherwise, check all the index IDs in the errors – if they’re all greater than 1, then YES! • If YES, you *could* just rebuild the corrupt indexes • Depends on the error, and the size of the index • But, what caused the corruption? • If you just rebuild the indexes, the corruption will probably happen again (especially if caused by the IO subsystem) • Make sure you do root-cause analysis and take preventative measures

Was there an un-repairable error found? • 8909, 8938, 8939 (page header corruption) errors where the type is ‘PFS’ • 8970 error: invalid data for the column type • 8992 error: CHECKCATALOG (metadata mismatch) error • Plus a few more obscure ones • E.g. an 8904 error (extent is allocated to two objects). This is usually repairable except in the case where the extent is marked as mixed and dedicated, and has pages allocated to multiple objects. • The repair is too complicated and/or destructive so is not attempted. • None of these can be automatically repaired

Restore or repair? REPAIR_ALLOW_DATA_LOSS is destructive Lot of errors cannot be fixed automatically Restore is better if you have a good backup and proper backup strategy

SQL Server Disaster Recovery 101