Log-based Media Recovery Techniques for Transactional Information Systems

Chapter 16: Media Recovery • 16.2 Log-based Method • 16.2.1 Database Backup and Archive Logging • 16.2.2 Database Restore • 16.2.3 Analyis of MTTDL • 16.3 Storage Redundancy • 16.3.1 Techniques Based on Mirroring • 16.3.2 Techniques Based on Error-Correcting Codes • 16.4 Disaster Recovery • 16.5 Lessons Learned „More than any time in history mankind faces a crossroads. One path leads to despair and utter hopelessness, the other to total extinction. Let us pray that we have the wisdom to choose correctly.“ (Woody Allen) Transactional Information Systems

Failure Model and Assessment Criteria • Failures whose repair requires media recovery: • disk failures (damaged media) • corrupted pages on disk (single-block read error) • environmental failures • fire, water damage, disasters • serious bugs in operational server software • erroneous user input • Assessment criteria: • availability (MTTF / (MTTF + MTTR) • survivability level: • number of simultaneous failures that can be repaired • mean time to data loss (MTTDL) Transactional Information Systems

Log-based Media Recovery • 2-step recovery: • replace failed disk (or remap corrupted disk blocks) • and reload data from backup copy • redo history using archive log • from begin of last completed complete backup; • can cope with rollbacks and crashes/restarts • between time of backup and media recovery, using CLEs; • then undo losers (like in crash recovery) with limited, pragmatic form of environmental recovery by selectively skipping log entries Transactional Information Systems

... ... Components for Log-based Media Recovery stable database backup June 20 backup June 27 backup July 4 shadow database files for the stable log begin- backup June 27 end- backup June 27 begin- backup July 4 ... begin (Ti) write (x,...) end (Ti) begin (Tk) Media RecoveryLSN soft crash soft crash database disk failure archive log redo pass undo pass Transactional Information Systems

Database Backup and Archive Logging • complete or incremental (modified pages only) • online backup of selected tablespaces: • creates „fuzzy“ copy on backup disk(s) or tape • containing updates of active transactions • scans page-mapping table and resets modified flags • scan position saved in checkpoint log entries • may copy (stale) pages directly from disk (bypassing cache) • archive log copies (replicates) all log entries from stable log • since the begin of the last completed complete backup • can garbage-collect log entries older than MediaRecoveryLSN := • min {begin-backup log entry of most recent completed backup, • SystemRedoLSN as of begin-backup, • current OldestUndoLSN} Transactional Information Systems

Database Restore restore (pageset): for each page in pageset do identify the most recent (incremental or complete) backup that contains a copy of the page; copy the page onto the replaced disk; end /*for*/; perform redo pass on the archive log using the redo-history algorithm, starting from MediaRecoveryLSN and ignoring all log entries not referring to pageset; perform analysis pass on the log, starting from most recent checkpoint, to identify loser transactions; perform undo pass on the log for loser transactions; can be accelerated by parallelizing redo, offline merging multiple incremental backups into complete backup, and/or applying redo offline to backup copy („shadow database“) Transactional Information Systems

Correctness and Quality of Log-based Media Recovery Theorem 16.1: The backup/log-based media recovery algorithm provides correct recovery after media failures by reconstructing the data such that it captures exactly all winner transactions in the original serialization order. Transactional Information Systems

2 1 / MTTF db failed; backup and log ok 2 / MTTF 1 4 db ok; backup and log ok db failed; backup or log failed 1 / MTTR recovery 3 2 / MTTF db ok; backup or log failed 1 / MTTF 1 / MTTRbackup Analysis of MTTDL (1) Markov chain model: Transactional Information Systems

Analysis of MTTDL (2) rij: transition rate from state i to state j Eij = E[time from entering state i until entering state j] Hi = E[time between entering and leaving state i] = pik = P[transition from i to k | state i is left] = solve for given Markov chain: E12 = H1 + p13 E32 E13 = H1 + p12 E23 E14 = H1 + p12 E24 + p13 E34 E21 = H2 E23 = H2 + p21 E13 E24 = H2 + p21 E14 E31 = H3 E32 = H3 + p31 E12 E34 = H3 + p31 E14 yielding Transactional Information Systems

Chapter 16: Media Recovery • 16.2 Log-based Method • 16.3 Storage Redundancy • 16.3.1 Techniques Based on Mirroring • 16.3.2 Techniques Based on Error-Correcting Codes • 16.4 Disaster Recovery • 16.5 Lessons Learned Transactional Information Systems

disk 1 disk 2 disk 3 disk 4 block 1.1 = 2.1' block 2.1 = 1.1' block 3.1 =4.1' block 4.1 = 3.1' block 1.2 = 2.2' block 2.2 = 1.2' block 3.2 =4.2' block 4.2 = 3.2' ... block 1.3 = 2.3' block 2.3 = 1.3' block 3.3 =4.3' block 4.3 = 3.3' block 1.4 = 2.4' block 2.4 =1.4' block 3.4 =4.4' block 4.4 =3.4' ... ... ... ... mirrored disk pair mirrored disk pair Mirrored Disk Pairs storage redundancy techniques provide protection against disk failure with continuous availability; recovery rebuilds contents of failed disk on hot spare writes routed to both disks of a pair, reads optimized for seek time or load balance Transactional Information Systems

disk 1 disk 2 disk 3 disk 4 1.1 = 2.m+1' 2.1 = 3.m+1' 3.1 =4.m+1' 4.1 = 1.m+1' 1.2 = 3.m+2' 2.2 = 4.m+2' 3.2 =1.m+2' 4.2 = 2.m+2' 1.3 = 4.m+3' 2.3 = 1.m+3' 3.3 =4.m+3' 4.3 = 3.m+3' 1.4 = 2.m+4' 2.4 =3.m+4' 3.4 =4.m+4' 4.4 =1.m+4' ... ... ... ... 1.m+1 = 4.1' 2.m+1 = 1.1' 3.m+1 =2.1' 4.m+1 = 3.1' 1.m+2 = 3.2' 2.m+2 = 4.2' 3.m+2 =1.2' 4.m+2 = 2.2' 1.m+3 = 2.3' 2.m+3 = 3.3' 3.m+3 =4.3' 4.m+3 = 1.3' 1.m+4 = 4.4' 2.m+4 = 1.4' 3.m+4 =2.4' 4.m+4 = 3.4' ... ... ... ... Declustered Mirroring for group size G, replicas of blocks on disk j are placed round-robin on disks j+1, ..., G, 1, ..., j-1  copy of block j.k of disk j is on disk (j+1+(k mod (G-1))) mod G +1  less performance degradation during rebuild from G-1 disks Transactional Information Systems

RAID-4: Parity Groups RAID (redundant arrays of independent disks): lower storage overhead than mirroring, but higher write cost • for each block k of disks 1, ..., G maintain a parity block • on a dedicated parity disk G+1 • upon write to block k of disk j: • new parity (1.k, ..., G.k) on parity disk G+1 := • old parity (1.k, ..., G.k)  old contents (j.k)  new contents (j.k) • upon failure of disk j, block j.k can be reconstructed from • blocks 1.k, ..., (j-1).k, (j+1).k, ..., G.k and the parity block (G+1).k Transactional Information Systems

disk 1 disk 2 disk N parity disk spare disk block 1.1 block 2.1 block N.1 (1.1 ... N.1) ... block 1.2 block 2.2 block N.2  (1.2 ... N.2) block 1.3 block 2.3 block N.3  (1.3 ... N.3) block 1.4 block 2.4 block N.4  (1.4 ... N.4) ... ... ... ... during normal operation  disk 1 disk 2 disk N parity disk spare disk block 2.1 block 1.1 block N.1 (1.1 ... N.1) ... block 2.2 block 1.2 block N.2  (1.2 ... N.2) block 2.3 block 1.3 block N.3  (1.3 ... N.3) block 2.4 block 1.4 block N.4  (1.4 ... N.4) ... ... ... ... during repair  Illustration of RAID-4 (Parity Groups) Transactional Information Systems

disk 1 disk 2 disk 3 disk N+1 block 1.1 block 2.1 block N.1 (1.1 ... N.1) ... (2.2 ... N+1.2) block 2.2 block N.2 block N+1.2 block 1.3 (3.3 ... 1.3) block 3.3 block N+1.3 block 1.4 block 2.4 (4.4 ... 2.4) block N+1.4 ... ... ... ... RAID-5: Parity Striping • eliminates the bottleneck of single parity disk • by placing the parity blocks of a group round-robin across • the group‘s disks (striping): • parity block for N blocks with number k • resides on disk (k+N-1) mod (N+1) +1 Transactional Information Systems

Reducing the small-write penalty: • parity logging (possibly in safe RAM) • to defer and batch parity writes • floating parity blocks written to convenient tracks • (with dynamically adjusted block-mapping table) • parity block declustering (clustered RAID): • construct parity blocks for groups of G blocks and • spread them uniformly across C > G+1 disks •  shorter rebuild • because of lower per-disk extra load in degraded mode • Coping with multiple disk failures: • use appropriate error-correcting code • (e.g., Reed-Solomon code) (RAID-6) • to mask two disk failures within a disk group Extended RAID Systems Transactional Information Systems

disk 3 disk 4 disk 1 disk 2 disk 5 group 1 parity 1 group 1 group 1 group 2 group 2 group 2 parity 2 parity 3 group 3 group 3 group 3 group 4 parity 4 group 4 group 4 parity 5 group 5 group 5 group 5 ... ... ... ... ... Parity-Block Declustering (Clustered RAID) C=5 G=3 • Requirements for placement of n parity block groups: • for each group of G+1 blocks, the blocks must be on different disk • each disk holds n/C parity blocks • for the m=n(G+1)/C groups represented by the blocks of a given disk, • the mG blocks that belong to these groups are evenly distributed • across all other C-1 disks •  combinatorial block design Transactional Information Systems

Rebuild Algorithms • rebuild failed disk online without interrupting • accesses to the data that resided on the failed disk • reconstruct blocks of the failed disk on demand • optimizations: • redirect disk-reads to the new disk • for blocks that are already rebuilt, • maintain parity like during normal operation • for blocks that are already rebuilt • cache blocks that are reconstructed for regular accesses • and write them to the new disk when convenient • (piggyback rebuilding work on regular disk-reads, • thus rebuilding popular blocks early) Transactional Information Systems

Disk-Read Optimization in Degraded Mode disk-read (block (N+1).k): if block (N+1).k has already been rebuilt then fetch (block (N+1).k); else fetch (block 1.k); ...; fetch (block N.k) using the algorithm as during normal operation; contents of block (N+1).k := 1.k XOR 2.k XOR ... XOR N.k; return the contents of block (N+1).k; flush (block (N+1).k) at the discretion of the disk scheduling for disk N+1; mark block (N+1).k as rebuilt; end /*if*/; Transactional Information Systems

Disk-Write Optimization in Degraded Mode disk-write (block (N+1).k): if block (N+1).k has already been rebuilt then fetch (block (N+1).k) unless the block is still available in RAM; fetch (parity block j.k of the parity group to which (N+1).k belongs); else fetch (block 1.k); ...; fetch (block N.k); old contents of block (N+1).k := 1.k XOR 2.k XOR ... XOR N.k; let j.k be the parity block of this parity group; end /*if*/; compute new parity block j.k := old contents of block j.k XOR old contents of block (N+1).k XOR new contents of block (N+1).k flush (block (N+1).k) using the block's new contents; flush (block j.k) using new parity as block contents; mark block (N+1).k as rebuilt; Transactional Information Systems

Optimized Online Rebuild Algorithm rebuild (disk N+1) on spare disk: for each block k of the failed disk N+1 do if the block has not yet been rebuilt disk-write (block (N+1).k) using the algorithm for disk-writes in degraded mode, with low priority for the resulting fetch and flush I/O requests; end /*if*/; end /*for*/; Transactional Information Systems

Specific Considerations for Disaster Recovery • Backup resides at remote site • Maintain archive log at remote site by log shipping: • within distributed transactions • (or even replicate the database remotely) • without transactional control, but preserving • the serialization order of log entries • (with the risk of losing the tail of the log) • Backup server could even be „hot standby“ • (with failover similar to data-sharing cluster architecture) Transactional Information Systems

Lessons Learned • The redo-history recovery algorithm is appropriate also • for media recovery, based on a backup database and an archive log: • MediaRecoveryLSN marks log-truncation and redo starting point • Log-based media recovery is the most versatile method; • storage-redundancy techniques are attractive for continuous availability • Mirroring (with declustering) and RAID-5 are commodities, • clustered RAID is the best technique in terms of MTTDL and MTTR, • but complex to implement (needs block design) • Disaster recovery can adopt media recovery techniques with • remote backup/replication site Transactional Information Systems

Log-based Media Recovery Techniques for Transactional Information Systems

Log-based Media Recovery Techniques for Transactional Information Systems

Presentation Transcript

Photo Recovery Software-Media Recovery-Image Recovery-Photo

Chapter 16: Recovery System

CHAPTER 16-USING ELECTRONIC MEDIA

16. Global Media

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

CHAPTER 16

Chapter 16

CHAPTER 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16

Chapter 16: Recovery System

Chapter 16: Recovery System