1 / 44

Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science

•. What and Why?. •. Correct and Efficient Redo. •. Correct Undo. •. Quantitative Guarantees. •. Where and How?. Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science. (Rationale, State of the Art, and Research Directions). Gerhard Weikum weikum@cs.uni-sb.de

vborrego
Download Presentation

Database Recovery – Heisenbugs, Houdini’s Tricks, and the Paranoia of Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What and Why? • Correct and Efficient Redo • Correct Undo • Quantitative Guarantees • Where and How? Database Recovery –Heisenbugs, Houdini’s Tricks, and the Paranoia of Science (Rationale, State of the Art, and Research Directions) Gerhard Weikum weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de Outline:

  2. exotic and underrated • fits well with theme of graduate studies program • critical for dependable Internet-based E-services • I wrote a textbook on the topic • Paranoia of science: hope (and strive) for the best, but prepare for the worst „All hardware and software systems must be verified.“ (Wolfgang Paul, U Saarland) „Software testing (and verification) doesn‘t scale.“ (James Hamilton, MS and Ex-IBM) „If you can‘t make a problem go away, make it invisible.“ (David Patterson, UC Berkeley)  Recovery Oriented Computing Why This Topic? Recovery: repair data after software failures

  3. Atomic transactions (consistent state transitions on db) • Log entries: • physical (before- • and after-images) • logical (record ops) • physiological • (page transitions) • timestamped by LSNs Why Exactly is Recovery Needed? Database Cache 215 page b Log Buffer 219 page q page p 155 nil 220 begin(t3) page z 217 219 write(q,t1) Volatile Memory force fetch flush Stable Storage 215 page b 218 commit(t2) 88 page q 217 write(z,t1) 216 write(q,t2) 155 page p Stable Database Stable Log 158 page z 215 write(b,t1) ...

  4. LSN in page header implements testable state 216 page q 214 page b 215 216 217 write(b,t1) write(z,t1) write(q,t2) How Does Recovery Work? • Analysis pass: determine winners vs. losers • (by scanning the stable log) • Redo pass: redo winner writes (by applying logged op) • Undo pass: undo loser writes (by applying inverse op) losers: {t1}, winners: {t2} redo(216, write(q,t2) ) undo (217, write(z,t1) ) ? undo (215, write(b,t1) ) page b 215 218 commit(t2) page q 88 217 write(z,t1) 215 216 write(q,t2) 199 page p 155 Stable Database Stable Log 158 page z 215 write(b,t1) 208 ...

  5. Transient software failures are the main problem: Heisenbugs (non-repeatable exceptions caused by stochastic confluence of very rare events)  • Failure model for crash recovery: • fail-stop (no dynamic salvation/resurrection code) • soft failures (stable storage survives)  „Self-healing“ recovery = amnesia + data repair (from „trusted“ store) + re-init Heisenbugs and the Recovery Rationale • Failure causes: • power: 2 000 h MTTF or with UPS • chips: 100 000 h MTTF • system software: 200 – 1 000 h MTTF • telecomm lines: 4 000 h MTTF • admin error: ??? or  with auto-admin • disks: 800 000 h MTTF • environment: > 20 000 h MTTF

  6. up 1 / MTTF 1 / MTTR down Stationary availability = Goal: Continuous Availability Business apps and e-services demand 24 x 7 availability 99.999 availability would be acceptable (5 min outage/year) Downtime costs (per hour): Brokerage operations: $ 6.4 Mio. Credit card authorization: $ 2.6 Mio. Ebay: $ 225 000 Amazon: $ 180 000 Airline reservation: $ 89 000 Cell phone activation: $ 41 000 (Source: Internet Week 4/3/2000) State of the art: DB servers  99.99 to 99.999 % Internet e-service (e.g., Ebay)  99 %

  7. Tradeoffs & Tensions What Do We Need to Optimize for? • Fast restart (for high availability) by bounding the log and minimizing page fetches • Low overhead during normal operation by minimizing forced log writes (and page flushes) • High transaction concurrency during normal operation • Correctness (and simplicity) in the presence of many subtleties

  8. State of the Art: Houdini’s Tricks tricks, patents, and recovery gurus: • very little work on verification of (toy) algorithms: • C. Wallace / N. Soparkar / Y. Gurevich • D. Kuo / A. Fekete / N. Lynch • C. Martin / K. Ramamritham

  9. Outline  What and Why? • Correct and Efficient Redo • Correct Undo • Quantitative Guarantees • Where and How? Murphy‘s Law: Whatever can go wrong, will go wrong.

  10. Data Structures for Logging & Recovery type Page: record of PageNo: id; PageSeqNo: id; Status: (clean, dirty); Contents: array [PageSize] of char; end; persistent var StableDatabase: set[PageNo] of Page; var DatabaseCache: set[PageNo] of Page; type LogEntry: record of LogSeqNo: id; TransId: id; PageNo: id; ActionType:(write, full-write, begin, commit, rollback); UndoInfo: array of char; RedoInfo: array of char; PreviousSeqNo: id; end; persistent var StableLog: list[LogSeqNo] of LogEntry; var LogBuffer: list[LogSeqNo] of LogEntry; modeled in functional manner with test op  state: write s on page p  StableDatabase  StableDatabase[p].PageSeqNo  s write s on page p  CachedDatabase  ( ( p  DatabaseCache  DatabaseCache[p].PageSeqNo  s )  ( p  DatabaseCache  StableDatabase[p].PageSeqNo  s ) )

  11. Correctness Criterion A crash recovery algorithm is correct if it guarantees that, after a system failure, the cached database will eventually be equivalent (i.e., reducible) to a serial order of the committed transactions that coincides with the serialization order of the history.

  12. Simple Redo Pass redo pass ( ): min := LogSeqNo of oldest log entry in StableLog; max := LogSeqNo of most recent log entry in StableLog; for i := min to max do if StableLog[i].TransId not in losers then pageno = StableLog[i].PageNo; fetch (pageno); case StableLog[i].ActionType of full-write: full-write (pageno) with contents from StableLog[i].RedoInfo; write: if DatabaseCache[pageno].PageSeqNo < i then read and write (pageno) according to StableLog[i].RedoInfo; DatabaseCache[pageno].PageSeqNo := i; end /*if*/; end /*case*/; end /*for*/;

  13. Correctness of Simple Redo & Undo Invariants during redo pass (compatible with serialization): Invariant of undo pass (based on serializability / reducibility of history):

  14. Redo Optimization 1: Log Truncation & Checkpoints continuously advance the log begin (garbage collection): for redo, can drop all entries for page p that precede the last flush action for p =: RedoLSN (p)  SystemRedoLSN := min{RedoLSN (p) | dirty page p} track dirty cache pages and periodically write DirtyPageTable (DPT) into checkpoint log entry: type DPTEntry: record of PageNo: id; RedoSeqNo: id; end; var DirtyPages: set[PageNo] of DPTEntry; + add potentially dirty pages during analysis pass + flush-behind demon flushes dirty pages in background

  15. Redo Pass with CP and DPT redo pass ( ): cp := MasterRecord.LastCP; SystemRedoLSN := min{cp.DirtyPages[p].RedoSeqNo}; max := LogSeqNo of most recent log entry in StableLog; for i := SystemRedoLSN to max do if StableLog[i].ActionType = write or full-write and StableLog[i].TransId not in losers then pageno := StableLog[i].PageNo; if pageno in DirtyPages and i >= DirtyPages[pageno].RedoSeqNo then fetch (pageno); if DatabaseCache[pageno].PageSeqNo < i then read and write (pageno) according to StableLog[i].RedoInfo; DatabaseCache[pageno].PageSeqNo := i; else DirtyPages[pageno].RedoSeqNo := DatabaseCache[pageno].PageSeqNo + 1; end; end; end;

  16. DPT: b: RedoLSN=20 c: RedoLSN=30 skip redo d: RedoLSN=70 1st crash Example for Redo with CP and DPT 10:w(a) 40:w(d) t1 30:w(c) t2 1st restart 20:w(b) 50:w(d) t3 redo ana- lysis 70:w(d) t4 60:w(a) 80:w(b) t5 flush(d) flush(a) flush(b) CP 60 Stable DB 7 a Stable Log c 80 50 b d

  17. seek opt.: rot. opt.: Benefits of Redo Optimization • can save page fetches • can plan prefetching schedule for dirty pages • (order and timing of page fetches) • for minization of disk-arm seek time • and rotational latency • global opt.: • TSP based on tseek + trot

  18. Redo Optimization 2: Flush Log Entries during normal operation: log flush actions (without forcing the log buffer) during analysis pass of restart: construct more recent DPT analysis pass ( ) returns losers, DirtyPages: ... if StableLog[i].ActionType = write or full-write and StableLog[i].PageNo not in DirtyPages then DirtyPages += StableLog[i].PageNo; DirtyPages[StableLog[i].PageNo].RedoSeqNo := i; end; if StableLog[i].ActionType = flush then DirtyPages -= StableLog[i].PageNo; end; ... advantages: • can save many page fetches • allows better prefetch scheduling

  19. DPT: b: RedoLSN=20 c: RedoLSN=30 skip redo d: RedoLSN=70 1st crash Example for Redo with Flush Log Entries 10:w(a) 40:w(d) t1 30:w(c) t2 1st restart 20:w(b) 50:w(d) t3 redo ana- lysis 70:w(d) t4 60:w(a) 80:w(b) t5 flush(d) flush(a) flush(b) CP 60 Stable DB 7 a Stable Log c 80 50 b d

  20. Redo Optimization 3: Full-write Log Entries during normal operation: „occasionally“ log full after-image of a page p (absorbs all previous writes to that page) during analysis pass of restart: construct enhanced DPT DirtyPages[p].RedoSeqNo := LogSeqNo of full-write during redo pass of restart: can ignore all log entries of a page that precede its most recent full-write log entry advantages: • can skip many log entries • can plan better schedule for page fetches • allows better log truncation

  21. Correctness of Optimized Redo Invariant during optimized redo pass: builds on correct DPT construction during analysis pass

  22. Why is Page-based Redo Logging Good Anyway? • testable state with very low overhead • much faster than logical redo • logical redo would require replaying high-level operations with many page fetches (random IO) • can be applied to selective pages independently • can exploit perfectly scalable parallelism for redo of large multi-disk db • can efficiently reconstruct corrupted pages when disk tracks have errors (without full media recovery for entire db)

  23. Outline  What and Why?  Correct and Efficient Redo • Correct Undo • Quantitative Guarantees • Where and How? Corollary to Murphy‘s Law: Murphy was an optimist.

  24. Winners Follow Losers restart complete crash 10:w(a) t1 20:w(a) rollback t2 30:w(a) t3 40:w(b) 50:w(a) 60:w(a) t4 t5 • occurs because of: • rollbacks during normal operation • repeated crashes • soft crashes between backup and media failure • concurrency control with commutative ops on same page

  25. 49 60 49 2 30 10 2 a a a a a a a Winners Follow Losers restart complete crash 10:w(a) t1 20:w(a) rollback t2 30:w(a) t3 40:w(b) 50:w(a) 60:w(a) t4 t5 • Problems: • old losers prevent log truncation • LSN-based testable state infeasible redo (30: w(a) , t3) redo (60: w(a) , t5) redo (10: w(a) , t1) undo (50: w(a) , t4)

  26. Treat Completed Losers as Winnersand Redo History 10:w(a) t1 25: w-1(a) 20:w(a) t2 30:w(a) t3 55: w-1(a) 57: w-1(b) 60:w(a) 40:w(b) 50:w(a) t5 t4 • Solution: • Undo generates compensation log entries (CLEs) • Undo steps are redone during restart

  27. CLE Backward Chaining forBounded Log Growth (& More Goodies) 10:w(a) t1 25: w-1(a) 20:w(a) t2 30:w(a) t3 55: w-1(a) 57: w-1(b) 40:w(b) 50:w(a) t4 CLE points to predecessor of inverted write ... begin t4 40: w(b) 50: w(a) CLE 55: w-1(a) CLE 57: w-1(b) commit t4

  28. CLE Backward Chaining forBounded Log Growth (& More Goodies) 10:w(a) t1 25: w-1(a) 20:w(a) t2 30:w(a) t3 55: w-1(a) 40:w(b) 50:w(a) t4 CLE points to predecessor of inverted write ... begin t4 40: w(b) 50: w(a) CLE 55: w-1(a)

  29. ... while exists t in losers with losers[t].LastSeqNo <> nil do nexttrans = TransNo in losers such that losers[nexttrans].LastSeqNo = max {losers[x].LastSeqNo | x in losers}; nextentry := losers[nexttrans].LastSeqNo; case StableLog[nextentry].ActionType of compensation: losers[nexttrans].LastSeqNo := StableLog[nextentry].NextUndoSeqNo; write: ... newentry.LogSeqNo := new sequence number; newentry.ActionType := compensation; newentry.PreviousSeqNo := ActiveTrans[transid].LastSeqNo; newentry.NextUndoSeqNo := nextentry.PreviousSeqNo; ActiveTrans[transid].LastSeqNo := newentry.LogSeqNo; LogBuffer += newentry; ... Undo Algorithm

  30. Correctness of Undo Pass Invariant of undo pass (assuming tree reducibility of history):

  31. Outline  What and Why?  Correct and Efficient Redo  Correct Undo • Quantitative Guarantees • Where and How? There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) 97.3 % of all statistics are made up. (Anonymous)

  32. Online control for given system configuration: • observe length of stable log and dirty page fraction • estimate redo time based on observations • take countermeasures if redo time becomes too long • (log truncation, flush-behind demon priority) Planning of system configuration: • (stochastically) predict (worst-case) redo time as a • function of workload and system characteristics • configure system (cache size, caching policy, #disks, • flush-behind demon, etc.) to guarantee upper bound Towards Quantitative Guarantees

  33.     # db I/Os for redo 1) # dirty pages 2) longest dirty lifetime  log scan time 3) # log entries for dirty pages  # redo steps checkpoint crash Key Factors for Bounded Redo Time page 1     page 2      page 3 page 4      t  not in cache write access in cache & dirty in cache & clean at crash time

  34. solve stationary equations: k,j {1,2,3} • di := P[page i is dirty] = 3  with pages permuted such that di di+1  Analysis of DB Page Fetches During Redo p12=ci p22=ci (1-i) p23=ci i per page Markov model: p33=ci (1-fi) p11=(1-ci) 1:out 2:clean 3:dirty p21=(1-ci) p32=ci fi p31=(1-ci) N pages with: reference prob. 1 ... N (i i+1) cache residence prob. c1 ... cN write prob. 1 ... N page reference arrival rate  flush-behind demon I/O rate f per disk I/O rate disk P[flush in time 1/]:f1 ... fN

  35. Subproblem: Analysis of Cache Residence with LRU-K cache replacement (e.g., K=2) and cache size M: 3 2 3 1 2 2 3 3 2 1 time bK now N pages with: reference prob. 1 ... N (i i+1) cache residence prob. c1 ... cN write prob. 1 ... N

  36. wait time service time S vacation time T Laplace transforms (see Takagi 1991): with utilization choose max. T s.t. (Chernoff bound) (e.g., t=10ms, =0.05) Subproblem: Analysis of Flush Prob. M/G/1 vacation server model for each disk request arrivals t request departures response time R response time R exhaustive-service vacations: whenever disk queue is empty invoke flush-behind demon for time period T

  37. So What? predictability of workload  configuration performance !!! !!!??? e.g.: allows us to auto-tune system for workload  configuration performance goals !!!???!!!

  38. Outline  What and Why?  Correct and Efficient Redo  Correct Undo  Quantitative Guarantees • Where and How? More than any time in history mankind faces a crossroads. One path leads to despair and utter hopelessness, the other to total extinction. Let us pray that we have the wisdom to choose correctly. (Woody Allen)

  39. Where Can We Contribute? • bigger picture of dependability and quality guarantees • turn sketch of correctness reasoning into verified recovery code • turn sketch of restart time analysis into quality guarantee • address all other aspects of recovery

  40. L1LogRecMaster *InitL1LogIOMgr (Segment *logsegment,int LogSize,int LogPageSize,int newlog) { L1LogRecMaster *recmgr = NULL; BitVector p = 1; int i; NEW (recmgr, 1); recmgr->Log = logsegment; A_LATCH_INIT (&(recmgr->BufferLatch)); A_LATCH_INIT (&(recmgr->UsedListLatch)); A_LATCH_INIT (&(recmgr->FixCountLatch)); recmgr->FixCount = 0; recmgr->StoreCounter = 0; recmgr->PendCount = 0; A_SEM_INIT (&(recmgr->PendSem),0); NEW (recmgr->Frame,BUFFSIZE); recmgr->Frame[0] = (LogPageCB *)MALLOC (LogPageSize * BUFFSIZE); for (i = 1;i < BUFFSIZE;i++) recmgr->Frame[i] = (LogPageCB *)(((char*)recmgr->Frame[i - 1]) + LogPageSize); recmgr->InfoPage = (InfoPageCB *) MALLOC (LogPageSize); recmgr->UsedList = InitHash (HASHSIZE, HashPageNo, NULL, NULL, ComparePageNo, 4); recmgr->ErasedFragList = InitHash (HASHSIZE, HashPageNo,NULL,NULL,ComparePageNo, 4); for (i = 0; i < sizeof(BitVector) * 8; i++) { recmgr->Mask[i] = p; p = p * 2; } if (newlog) { recmgr->InfoPage->LogSize = LogSize; recmgr->InfoPage->PageSize = LogPageSize; recmgr->InfoPage->Key = 0; recmgr->InfoPage->ActLogPos = 1; recmgr->InfoPage->ActBufNr = 0; recmgr->InfoPage->ActSlotNr = 0; InitLogFile (recmgr,logsegment); LogIOWriteInfoPage (recmgr); } else ReadInfoPage (recmgr); return (recmgr); } /* end LogIOInit */

  41. static int log_fill(dblp, lsn, addr, len) DB_LOG *dblp; DB_LSN *lsn; void *addr; u_int32_t len; { ... while (len > 0) { /* Copy out the data. */ if (lp->b_off == 0) lp->f_lsn = *lsn; if (lp->b_off == 0 && len >= bsize) { nrec = len / bsize; if ((ret = __log_write(dblp, addr, nrec*bsize)) != 0) return (ret); addr = (u_int8_t *)addr + nrec * bsize; len -= nrec * bsize; ++lp->stat.st_wcount_fill; continue; } /* Figure out how many bytes we can copy this time. */ remain = bsize - lp->b_off; nw = remain > len ? len : remain; memcpy(dblp->bufp + lp->b_off, addr, nw); addr = (u_int8_t *)addr + nw; len -= nw; lp->b_off += nw; /* If we fill the buffer, flush it. */ if (lp->b_off == bsize) { if ((ret = __log_write(dblp, dblp->bufp, bsize)) != 0) return (ret); lp->b_off = 0; ++lp->stat.st_wcount_fill; } } return (0); }

  42. Where Can We Contribute? • bigger picture of dependability and quality guarantees • turn sketch of correctness reasoning into verified recovery code • turn sketch of restart time analysis into quality guarantee • address all other aspects of recovery

  43. What Else is in the Jungle? • more optimizations for logging and recovery of specific data structures (e.g., B+ tree indexes) • incremental restart (early admission of new transactions while redo or undo is still in progress) • failover recovery for clusters with shared disks and distributed memory • media recovery, remote replication, and configuration for availability goals  M. Gillmann • process and message recovery for failure masking to applications and users  G. Shegalov  collab with MSR

  44. sw engineering  sw engineering for verification • sys architecture  sys architecture for predictability Where Can We Contribute? • bigger picture of dependability and quality guarantees • turn sketch of correctness reasoning into verified recovery code • turn sketch of restart time analysis into quality guarantee • address all other aspects of recovery

More Related