440 likes | 626 Views
•. What and Why?. •. Correct and Efficient Redo. •. Correct Undo . •. Quantitative Guarantees . •. Where and How?. Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science. (Rationale, State of the Art, and Research Directions). Gerhard Weikum
E N D
• What and Why? • Correct and Efficient Redo • Correct Undo • Quantitative Guarantees • Where and How? Database Recovery –Heisenbugs, Houdini’s Tricks, and the Paranoia of Science (Rationale, State of the Art, and Research Directions) Gerhard Weikum weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de Outline:
• exotic and underrated • fits well with theme of graduate studies program • critical for dependable Internet-based E-services • I wrote a textbook on the topic • Paranoia of science: hope (and strive) for the best, but prepare for the worst „All hardware and software systems must be verified.“ (Wolfgang Paul, U Saarland) „Software testing (and verification) doesn‘t scale.“ (James Hamilton, MS and Ex-IBM) „If you can‘t make a problem go away, make it invisible.“ (David Patterson, UC Berkeley) Recovery Oriented Computing Why This Topic? Recovery: repair data after software failures
Atomic transactions (consistent state transitions on db) • Log entries: • physical (before- • and after-images) • logical (record ops) • physiological • (page transitions) • timestamped by LSNs Why Exactly is Recovery Needed? Database Cache 215 page b Log Buffer 219 page q page p 155 nil 220 begin(t3) page z 217 219 write(q,t1) Volatile Memory force fetch flush Stable Storage 215 page b 218 commit(t2) 88 page q 217 write(z,t1) 216 write(q,t2) 155 page p Stable Database Stable Log 158 page z 215 write(b,t1) ...
LSN in page header implements testable state 216 page q 214 page b 215 216 217 write(b,t1) write(z,t1) write(q,t2) How Does Recovery Work? • Analysis pass: determine winners vs. losers • (by scanning the stable log) • Redo pass: redo winner writes (by applying logged op) • Undo pass: undo loser writes (by applying inverse op) losers: {t1}, winners: {t2} redo(216, write(q,t2) ) undo (217, write(z,t1) ) ? undo (215, write(b,t1) ) page b 215 218 commit(t2) page q 88 217 write(z,t1) 215 216 write(q,t2) 199 page p 155 Stable Database Stable Log 158 page z 215 write(b,t1) 208 ...
Transient software failures are the main problem: Heisenbugs (non-repeatable exceptions caused by stochastic confluence of very rare events) • Failure model for crash recovery: • fail-stop (no dynamic salvation/resurrection code) • soft failures (stable storage survives) „Self-healing“ recovery = amnesia + data repair (from „trusted“ store) + re-init Heisenbugs and the Recovery Rationale • Failure causes: • power: 2 000 h MTTF or with UPS • chips: 100 000 h MTTF • system software: 200 – 1 000 h MTTF • telecomm lines: 4 000 h MTTF • admin error: ??? or with auto-admin • disks: 800 000 h MTTF • environment: > 20 000 h MTTF
up 1 / MTTF 1 / MTTR down Stationary availability = Goal: Continuous Availability Business apps and e-services demand 24 x 7 availability 99.999 availability would be acceptable (5 min outage/year) Downtime costs (per hour): Brokerage operations: $ 6.4 Mio. Credit card authorization: $ 2.6 Mio. Ebay: $ 225 000 Amazon: $ 180 000 Airline reservation: $ 89 000 Cell phone activation: $ 41 000 (Source: Internet Week 4/3/2000) State of the art: DB servers 99.99 to 99.999 % Internet e-service (e.g., Ebay) 99 %
Tradeoffs & Tensions What Do We Need to Optimize for? • Fast restart (for high availability) by bounding the log and minimizing page fetches • Low overhead during normal operation by minimizing forced log writes (and page flushes) • High transaction concurrency during normal operation • Correctness (and simplicity) in the presence of many subtleties
State of the Art: Houdini’s Tricks tricks, patents, and recovery gurus: • very little work on verification of (toy) algorithms: • C. Wallace / N. Soparkar / Y. Gurevich • D. Kuo / A. Fekete / N. Lynch • C. Martin / K. Ramamritham
Outline What and Why? • Correct and Efficient Redo • Correct Undo • Quantitative Guarantees • Where and How? Murphy‘s Law: Whatever can go wrong, will go wrong.
Data Structures for Logging & Recovery type Page: record of PageNo: id; PageSeqNo: id; Status: (clean, dirty); Contents: array [PageSize] of char; end; persistent var StableDatabase: set[PageNo] of Page; var DatabaseCache: set[PageNo] of Page; type LogEntry: record of LogSeqNo: id; TransId: id; PageNo: id; ActionType:(write, full-write, begin, commit, rollback); UndoInfo: array of char; RedoInfo: array of char; PreviousSeqNo: id; end; persistent var StableLog: list[LogSeqNo] of LogEntry; var LogBuffer: list[LogSeqNo] of LogEntry; modeled in functional manner with test op state: write s on page p StableDatabase StableDatabase[p].PageSeqNo s write s on page p CachedDatabase ( ( p DatabaseCache DatabaseCache[p].PageSeqNo s ) ( p DatabaseCache StableDatabase[p].PageSeqNo s ) )
Correctness Criterion A crash recovery algorithm is correct if it guarantees that, after a system failure, the cached database will eventually be equivalent (i.e., reducible) to a serial order of the committed transactions that coincides with the serialization order of the history.
Simple Redo Pass redo pass ( ): min := LogSeqNo of oldest log entry in StableLog; max := LogSeqNo of most recent log entry in StableLog; for i := min to max do if StableLog[i].TransId not in losers then pageno = StableLog[i].PageNo; fetch (pageno); case StableLog[i].ActionType of full-write: full-write (pageno) with contents from StableLog[i].RedoInfo; write: if DatabaseCache[pageno].PageSeqNo < i then read and write (pageno) according to StableLog[i].RedoInfo; DatabaseCache[pageno].PageSeqNo := i; end /*if*/; end /*case*/; end /*for*/;
Correctness of Simple Redo & Undo Invariants during redo pass (compatible with serialization): Invariant of undo pass (based on serializability / reducibility of history):
Redo Optimization 1: Log Truncation & Checkpoints continuously advance the log begin (garbage collection): for redo, can drop all entries for page p that precede the last flush action for p =: RedoLSN (p) SystemRedoLSN := min{RedoLSN (p) | dirty page p} track dirty cache pages and periodically write DirtyPageTable (DPT) into checkpoint log entry: type DPTEntry: record of PageNo: id; RedoSeqNo: id; end; var DirtyPages: set[PageNo] of DPTEntry; + add potentially dirty pages during analysis pass + flush-behind demon flushes dirty pages in background
Redo Pass with CP and DPT redo pass ( ): cp := MasterRecord.LastCP; SystemRedoLSN := min{cp.DirtyPages[p].RedoSeqNo}; max := LogSeqNo of most recent log entry in StableLog; for i := SystemRedoLSN to max do if StableLog[i].ActionType = write or full-write and StableLog[i].TransId not in losers then pageno := StableLog[i].PageNo; if pageno in DirtyPages and i >= DirtyPages[pageno].RedoSeqNo then fetch (pageno); if DatabaseCache[pageno].PageSeqNo < i then read and write (pageno) according to StableLog[i].RedoInfo; DatabaseCache[pageno].PageSeqNo := i; else DirtyPages[pageno].RedoSeqNo := DatabaseCache[pageno].PageSeqNo + 1; end; end; end;
DPT: b: RedoLSN=20 c: RedoLSN=30 skip redo d: RedoLSN=70 1st crash Example for Redo with CP and DPT 10:w(a) 40:w(d) t1 30:w(c) t2 1st restart 20:w(b) 50:w(d) t3 redo ana- lysis 70:w(d) t4 60:w(a) 80:w(b) t5 flush(d) flush(a) flush(b) CP 60 Stable DB 7 a Stable Log c 80 50 b d
seek opt.: rot. opt.: Benefits of Redo Optimization • can save page fetches • can plan prefetching schedule for dirty pages • (order and timing of page fetches) • for minization of disk-arm seek time • and rotational latency • global opt.: • TSP based on tseek + trot
Redo Optimization 2: Flush Log Entries during normal operation: log flush actions (without forcing the log buffer) during analysis pass of restart: construct more recent DPT analysis pass ( ) returns losers, DirtyPages: ... if StableLog[i].ActionType = write or full-write and StableLog[i].PageNo not in DirtyPages then DirtyPages += StableLog[i].PageNo; DirtyPages[StableLog[i].PageNo].RedoSeqNo := i; end; if StableLog[i].ActionType = flush then DirtyPages -= StableLog[i].PageNo; end; ... advantages: • can save many page fetches • allows better prefetch scheduling
DPT: b: RedoLSN=20 c: RedoLSN=30 skip redo d: RedoLSN=70 1st crash Example for Redo with Flush Log Entries 10:w(a) 40:w(d) t1 30:w(c) t2 1st restart 20:w(b) 50:w(d) t3 redo ana- lysis 70:w(d) t4 60:w(a) 80:w(b) t5 flush(d) flush(a) flush(b) CP 60 Stable DB 7 a Stable Log c 80 50 b d
Redo Optimization 3: Full-write Log Entries during normal operation: „occasionally“ log full after-image of a page p (absorbs all previous writes to that page) during analysis pass of restart: construct enhanced DPT DirtyPages[p].RedoSeqNo := LogSeqNo of full-write during redo pass of restart: can ignore all log entries of a page that precede its most recent full-write log entry advantages: • can skip many log entries • can plan better schedule for page fetches • allows better log truncation
Correctness of Optimized Redo Invariant during optimized redo pass: builds on correct DPT construction during analysis pass
Why is Page-based Redo Logging Good Anyway? • testable state with very low overhead • much faster than logical redo • logical redo would require replaying high-level operations with many page fetches (random IO) • can be applied to selective pages independently • can exploit perfectly scalable parallelism for redo of large multi-disk db • can efficiently reconstruct corrupted pages when disk tracks have errors (without full media recovery for entire db)
Outline What and Why? Correct and Efficient Redo • Correct Undo • Quantitative Guarantees • Where and How? Corollary to Murphy‘s Law: Murphy was an optimist.
Winners Follow Losers restart complete crash 10:w(a) t1 20:w(a) rollback t2 30:w(a) t3 40:w(b) 50:w(a) 60:w(a) t4 t5 • occurs because of: • rollbacks during normal operation • repeated crashes • soft crashes between backup and media failure • concurrency control with commutative ops on same page
49 60 49 2 30 10 2 a a a a a a a Winners Follow Losers restart complete crash 10:w(a) t1 20:w(a) rollback t2 30:w(a) t3 40:w(b) 50:w(a) 60:w(a) t4 t5 • Problems: • old losers prevent log truncation • LSN-based testable state infeasible redo (30: w(a) , t3) redo (60: w(a) , t5) redo (10: w(a) , t1) undo (50: w(a) , t4)
Treat Completed Losers as Winnersand Redo History 10:w(a) t1 25: w-1(a) 20:w(a) t2 30:w(a) t3 55: w-1(a) 57: w-1(b) 60:w(a) 40:w(b) 50:w(a) t5 t4 • Solution: • Undo generates compensation log entries (CLEs) • Undo steps are redone during restart
CLE Backward Chaining forBounded Log Growth (& More Goodies) 10:w(a) t1 25: w-1(a) 20:w(a) t2 30:w(a) t3 55: w-1(a) 57: w-1(b) 40:w(b) 50:w(a) t4 CLE points to predecessor of inverted write ... begin t4 40: w(b) 50: w(a) CLE 55: w-1(a) CLE 57: w-1(b) commit t4
CLE Backward Chaining forBounded Log Growth (& More Goodies) 10:w(a) t1 25: w-1(a) 20:w(a) t2 30:w(a) t3 55: w-1(a) 40:w(b) 50:w(a) t4 CLE points to predecessor of inverted write ... begin t4 40: w(b) 50: w(a) CLE 55: w-1(a)
... while exists t in losers with losers[t].LastSeqNo <> nil do nexttrans = TransNo in losers such that losers[nexttrans].LastSeqNo = max {losers[x].LastSeqNo | x in losers}; nextentry := losers[nexttrans].LastSeqNo; case StableLog[nextentry].ActionType of compensation: losers[nexttrans].LastSeqNo := StableLog[nextentry].NextUndoSeqNo; write: ... newentry.LogSeqNo := new sequence number; newentry.ActionType := compensation; newentry.PreviousSeqNo := ActiveTrans[transid].LastSeqNo; newentry.NextUndoSeqNo := nextentry.PreviousSeqNo; ActiveTrans[transid].LastSeqNo := newentry.LogSeqNo; LogBuffer += newentry; ... Undo Algorithm
Correctness of Undo Pass Invariant of undo pass (assuming tree reducibility of history):
Outline What and Why? Correct and Efficient Redo Correct Undo • Quantitative Guarantees • Where and How? There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) 97.3 % of all statistics are made up. (Anonymous)
Online control for given system configuration: • observe length of stable log and dirty page fraction • estimate redo time based on observations • take countermeasures if redo time becomes too long • (log truncation, flush-behind demon priority) Planning of system configuration: • (stochastically) predict (worst-case) redo time as a • function of workload and system characteristics • configure system (cache size, caching policy, #disks, • flush-behind demon, etc.) to guarantee upper bound Towards Quantitative Guarantees
# db I/Os for redo 1) # dirty pages 2) longest dirty lifetime log scan time 3) # log entries for dirty pages # redo steps checkpoint crash Key Factors for Bounded Redo Time page 1 page 2 page 3 page 4 t not in cache write access in cache & dirty in cache & clean at crash time
solve stationary equations: k,j {1,2,3} • di := P[page i is dirty] = 3 with pages permuted such that di di+1 Analysis of DB Page Fetches During Redo p12=ci p22=ci (1-i) p23=ci i per page Markov model: p33=ci (1-fi) p11=(1-ci) 1:out 2:clean 3:dirty p21=(1-ci) p32=ci fi p31=(1-ci) N pages with: reference prob. 1 ... N (i i+1) cache residence prob. c1 ... cN write prob. 1 ... N page reference arrival rate flush-behind demon I/O rate f per disk I/O rate disk P[flush in time 1/]:f1 ... fN
Subproblem: Analysis of Cache Residence with LRU-K cache replacement (e.g., K=2) and cache size M: 3 2 3 1 2 2 3 3 2 1 time bK now N pages with: reference prob. 1 ... N (i i+1) cache residence prob. c1 ... cN write prob. 1 ... N
wait time service time S vacation time T Laplace transforms (see Takagi 1991): with utilization choose max. T s.t. (Chernoff bound) (e.g., t=10ms, =0.05) Subproblem: Analysis of Flush Prob. M/G/1 vacation server model for each disk request arrivals t request departures response time R response time R exhaustive-service vacations: whenever disk queue is empty invoke flush-behind demon for time period T
So What? predictability of workload configuration performance !!! !!!??? e.g.: allows us to auto-tune system for workload configuration performance goals !!!???!!!
Outline What and Why? Correct and Efficient Redo Correct Undo Quantitative Guarantees • Where and How? More than any time in history mankind faces a crossroads. One path leads to despair and utter hopelessness, the other to total extinction. Let us pray that we have the wisdom to choose correctly. (Woody Allen)
Where Can We Contribute? • bigger picture of dependability and quality guarantees • turn sketch of correctness reasoning into verified recovery code • turn sketch of restart time analysis into quality guarantee • address all other aspects of recovery
L1LogRecMaster *InitL1LogIOMgr (Segment *logsegment,int LogSize,int LogPageSize,int newlog) { L1LogRecMaster *recmgr = NULL; BitVector p = 1; int i; NEW (recmgr, 1); recmgr->Log = logsegment; A_LATCH_INIT (&(recmgr->BufferLatch)); A_LATCH_INIT (&(recmgr->UsedListLatch)); A_LATCH_INIT (&(recmgr->FixCountLatch)); recmgr->FixCount = 0; recmgr->StoreCounter = 0; recmgr->PendCount = 0; A_SEM_INIT (&(recmgr->PendSem),0); NEW (recmgr->Frame,BUFFSIZE); recmgr->Frame[0] = (LogPageCB *)MALLOC (LogPageSize * BUFFSIZE); for (i = 1;i < BUFFSIZE;i++) recmgr->Frame[i] = (LogPageCB *)(((char*)recmgr->Frame[i - 1]) + LogPageSize); recmgr->InfoPage = (InfoPageCB *) MALLOC (LogPageSize); recmgr->UsedList = InitHash (HASHSIZE, HashPageNo, NULL, NULL, ComparePageNo, 4); recmgr->ErasedFragList = InitHash (HASHSIZE, HashPageNo,NULL,NULL,ComparePageNo, 4); for (i = 0; i < sizeof(BitVector) * 8; i++) { recmgr->Mask[i] = p; p = p * 2; } if (newlog) { recmgr->InfoPage->LogSize = LogSize; recmgr->InfoPage->PageSize = LogPageSize; recmgr->InfoPage->Key = 0; recmgr->InfoPage->ActLogPos = 1; recmgr->InfoPage->ActBufNr = 0; recmgr->InfoPage->ActSlotNr = 0; InitLogFile (recmgr,logsegment); LogIOWriteInfoPage (recmgr); } else ReadInfoPage (recmgr); return (recmgr); } /* end LogIOInit */
static int log_fill(dblp, lsn, addr, len) DB_LOG *dblp; DB_LSN *lsn; void *addr; u_int32_t len; { ... while (len > 0) { /* Copy out the data. */ if (lp->b_off == 0) lp->f_lsn = *lsn; if (lp->b_off == 0 && len >= bsize) { nrec = len / bsize; if ((ret = __log_write(dblp, addr, nrec*bsize)) != 0) return (ret); addr = (u_int8_t *)addr + nrec * bsize; len -= nrec * bsize; ++lp->stat.st_wcount_fill; continue; } /* Figure out how many bytes we can copy this time. */ remain = bsize - lp->b_off; nw = remain > len ? len : remain; memcpy(dblp->bufp + lp->b_off, addr, nw); addr = (u_int8_t *)addr + nw; len -= nw; lp->b_off += nw; /* If we fill the buffer, flush it. */ if (lp->b_off == bsize) { if ((ret = __log_write(dblp, dblp->bufp, bsize)) != 0) return (ret); lp->b_off = 0; ++lp->stat.st_wcount_fill; } } return (0); }
Where Can We Contribute? • bigger picture of dependability and quality guarantees • turn sketch of correctness reasoning into verified recovery code • turn sketch of restart time analysis into quality guarantee • address all other aspects of recovery
What Else is in the Jungle? • more optimizations for logging and recovery of specific data structures (e.g., B+ tree indexes) • incremental restart (early admission of new transactions while redo or undo is still in progress) • failover recovery for clusters with shared disks and distributed memory • media recovery, remote replication, and configuration for availability goals M. Gillmann • process and message recovery for failure masking to applications and users G. Shegalov collab with MSR
• sw engineering sw engineering for verification • sys architecture sys architecture for predictability Where Can We Contribute? • bigger picture of dependability and quality guarantees • turn sketch of correctness reasoning into verified recovery code • turn sketch of restart time analysis into quality guarantee • address all other aspects of recovery