1.01k likes | 1.03k Views
Final Exam. Time: May 8 th , Thursday, 7-10 PM Format Open book / open notes Problems and short answer Difficulty similar to homework problems (Less time consuming). Recovery. Chapter 9, 10, 11.1-11.4 in Gray and Reuter. Adapted from slides by J. Gray & A. Reuter. Failure Types.
E N D
Final Exam • Time: May 8th, Thursday, 7-10 PM • Format • Open book / open notes • Problems and short answer • Difficulty similar to homework problems (Less time consuming) ECE 569
Recovery Chapter 9, 10, 11.1-11.4 in Gray and Reuter Adapted from slides by J. Gray & A. Reuter
Failure Types • Transaction Failure • Transaction issues abort • System Failure • Volatile memory is corrupted • Media Failure • Stable storage is corrupted ECE 569
Failure Model (Assumptions) • Failures can always be detected (Failstop) • System • Defensive programming • Error detecting codes (parity, checksums, etc.) • Media • Redundant information (e.g., description of block in header) • Checksums ECE 569
System Failure Recovery • Goal • At point of failure, history is H. • Recovery must restore DB to final state defined by C(H). • All information needed to accomplish this must be in stable storage ECE 569
Normal (no failure) Transaction Execution • TM generates the TRID at Begin_Work(). • Coordinates Commit, • RM joins work, generates log records, allows commit ECE 569
The Resource manager view Boolean Prepare(LSN *); /* invoked at 1. Return vote on commit */ void Commit(); /* called at commit 2 */ void Abort(); /* called at failed commit 2 or abort */ void UNDO(LSN); /* Undo the log record with this LSN */ void REDO(LSN); /* Redo the log record with this LSN */ void TM_Startup(LSN); /* TM restarting. Passes RM ckpt LSN */ LSN Checkpoint(LSN * low_water); /* TM checkpointing, Return RM ckpt LSN, set low water LSN */ ECE 569
The Transaction Manager • Transaction rollback. • coordinates transaction rollback to a savepoint or abort rollbacks can be initiated by any participant. • Resource manager restart. • If an RM fails and restarts, TM presents checkpoint anchor & RM undo/redo log • System restart. • TM drives local RM recovery (like RM restart) • TM resolves any in-doubt distributed transactions • Media recovery. • TM helps RM reconstruct damaged objects by providing • archive copies of object + the log of object since archived. • Node restart. • Transaction commit among independent TMs when a TM fails. ECE 569
When a Transaction Aborts • At transaction rollback • TM drives undo of each RM joined to the transaction • Can be to savepoint 0 (abort) or partial rollback. ECE 569
The Transaction Manager at Restart/Recovery • At restart, TM reading the log drives RM recovery. • Single log scan. • Single resolver of transactions. • Multiple logs possible, but more complex/more work. ECE 569
Resource Manager Concepts: Transaction UNDO Protocol declare cursor for transaction_log select rmid, lsn /* a cursor on the transaction's log */ from log /* it returns the resource manager name */ where trid = :trid /* and record id (log sequence number) */ descending lsn; /* and returns records in LIFO order */ void transaction_undo(TRID trid) /* Undo the specified transaction. */ { int sqlcode; /* event variables set by sql */ open cursor transaction_log; /* open an sql cursor on the trans log */ while (TRUE) /* scan trans log backwards & undo each*/ { /* fetch the next most recent log rec */ fetch transaction_log into :rmid, :lsn; /* */ if (sqlcode != 0) break; /* if no more, trans is undone, end loop */ rmid.undo(lsn); /* tell RM to undo that record */ } close cursor transaction_log; /* Undo scan is complete, close cursor */ }; /* return to caller */ ECE 569
Resource Manager Concepts: Restart REDO Protocol • Note: REDO forwards, UNDO backwards void log_redo(void) {declare cursor for the_log /* declare cursor from log start forward */ select rmid, lsn /* gets RM id and log record id (lsn) */ from log /* of all log records. */ ascending lsn; /* in FIFO order */ open cursor the_log; /* open an sql cursor on the log table */ while (TRUE) /* Scan log forward& redo each record. */ { fetch the_log into :rmid, :lsn; /* fetch the next log record */ if (sqlcode != 0) break; /* if no more, then all redone, end loop */ rmid.redo(lsn);} /* tell RM to redo that record */ close cursor the_log; /* Redo scan complete, close cursor */ }; /* return to caller */ ECE 569
Old State undo log record New State redo log record Idempotence • F(F(X)) == F(X): Needed in case restart fails (and restarts) • Redo(Redo(old_state,log), log) = Redo(new_state,log) = new_state • Undo(Undo(new_state,log), log) = Undo(old_state,log) = old_state ECE 569
Testable State: Can Tell If It Happened. IF operation not idempotent AND state not testable THEN recovery is impossible ECE 569
Kinds of Logging • Physical • Keep old and new value of container (page, file,...) • Pro: Simple • Allows recovery of physical object (e.g. broken page) • Con: Generates LOTS of log data • Logical • Keep call params such that you can compute F(x), F-1(x) • Pro: Sounds simple • Compact log. • Con: Doesn't work (wrong failure model). • Operations do not fail cleanly. ECE 569
Sample Physical LOG RECORD struct compressed_log_record_for_page_update /* */ { int opcode; /* opcode will say compressed page update*/ filename fname; /* name of file that was updated */ long pageno; /* page that was updated */ long offset; /* offset within page that was updated */ long length; /* length of field that was updated */ char old_value[length]; /* old value of field */ char new_value[length]; /* new value of field */ }; /* */ • Ordinary sequential insert is OK. • Update of sorted (B-tree) page: • update LSN • update page space map • update pointer to record • insert record at correct spot (move 1/2 the others) • Essentially writes whole page (old and new). • 16KB log records for 100-byte updates. ECE 569
Sample Physical LOG RECORD struct logical_log_record_for_insert /* */ { int opcode; /* opcode will says insert */ filename fname; /* name of file that was updated */ long length; /* length of record that was updated */ char record[length]; /* value record */ }; /* */ • Very compact. • Implies page update(s) for record (may be many pages long). • Implies index updates (may be many indices on base table) ECE 569
The trouble with Logical Logging • Logical logging needs to start UNDO/REDO with an action-consistent state. • Partial Actions • If an action runs to completion, we can use inverse action to UNDO operation. • What if action fails part of the way through? How do we put system in consistent state. • for example: insert (table, record) • ALL or NONE of the indices should be updated when logical UNDO/REDO is invoked. • Action Consistency • After a system failure, the state of persistent storage may not be action consistent. • How can we restore an action consistent state? ECE 569
Making Logical Logging Work: Shadows • Keep old copy of each page • Reset page to old copy at abort (no undo log) • Discard old copy at commit. • Handles all online failures due to: • Logic: e.g. duplicate key. • Limit: ran out of space • Contention: deadlock • Problem: forces page locking, only one updater per page. • What about restart? • Need to atomically write out all changed pages. ECE 569
Making Logical Logging Work: Shadows • Perform same shadow trick at disc level. • Keep shadow copy of old pages. • Write out new pages. • In one careful write, write out new page root. • Makes update atomic ECE 569
Shadows • Pro: Simple • Not such a bad deal with non-volatile ram • Con: page locking • extra space • extra overhead (for page maps) • extra IO • declusters sequential data ECE 569
Logical vs Physio-logical Logging Note: physical log records would be bigger for sorted pages. ECE 569
Physiological Logging Rules • Complex operations are a sequence of simple operations on pages. • Each operation is constructed as a mini-transaction • lock the object in exclusive mode • transform the object • generate an UNDO-REDO log record • record log LSN in object • unlock the object. • Action Consistent Object • When object semaphore free, no ops in progress. • Log-Consistency • Log contains log records of all complete page actions. ECE 569
Physiological Logging Rules - Online Operation • Each operation is structured as a mini-transaction. • Each operation generates an UNDO record. • No page operation fails with the semaphore set. • (exception handler must clean up state and UNFIX any pages). • Then Rollback can be physical to a page and logical within page. ECE 569
Physiological Logging Rules - Restart Operation • Need Page-Action consistent persistent state. • Pages are action consistent. • Committed actions can be redone from log. • Uncommitted actions can be undone from log. • WAL: Write Ahead Log • Write undo/redo log records before overwriting disk page • Only write action-consistent pages • Force-Log-At-Commit • Make transaction log records durable at commit. ECE 569
WAL and Force at Commit • WAL: Write Ahead Log • write page: • get page semaphore • copy page to buffer • give page semaphore /* avoids holding semaphore during IO */ • Force_log(Page(LSN)) /*WAL logic, probably already flushed*/ • Write buffer to disk. • WAL gives idempotence and testability. • Force-Log-At-Commit • At commit phase 1: • Force_log(transaction.max_lsn) ECE 569
The One Bit Resource Manager • Manages an array of transactional bits (the free space bit map). i = get_bit(); /* gets a free bit and sets it */ give_bit(i); /* returns a free bit */ ECE 569
The Bitmap and Its Log Records • The Data Structure struct { /* layout of the one-bit RM data structure */ LSN lsn; /* page LSN for WAL protocol */ xsemaphore sem; /* semaphore regulates access to the page */ Boolean bit[BITS]; /* page.bit[i] = TRUE => bit[i] is free */ } page; /* allocates the page structure */ • The Log Records struct /* log record format for the one-bit RM */ { int index; /* index of bit that was updated */ Boolean value; /* new value of bit[index] */ } log_rec; /* log record used by the one-bit RM */ const int rec_size = sizeof(log_rec); /*size of the log record body. */ ECE 569
Page and Log Consistency for 1-Bit RM • Data dirty if reflects an uncommitted transaction update. Otherwise, data is clean. • Page Consistency: • No clean free bit has been given to any transaction. • Every clean busy bit was given to exactly one transaction. • Dirty bits locked in X mode by updating transactions. • The page.lsn reflects most recent log record for page. • Log Consistency: • Log contains a record for every completed mini-transaction update to the page. ECE 569
give_bit() • get_bit() & give_bit(i) temporarily violate page consistency. • Mini-transaction holds semaphore while violating consistency. • Makes page & log mutually consistent before releasing sem. • each mini-transaction observes a consistent page state. void give_bit(int i) /* free a bit */ { if (LOCK_GRANTED==lock(i,LOCK_X,LOCK_LONG,0)) /* Lock bit */ { Xsem_get(&page.sem); /* get page sem */ page.bit[i] = TRUE; /* free the bit */ log_rec.index = i; /* generate log rec*/ log_rec.value = TRUE; /*saying bit is free*/ page.lsn = log_insert(log_rec,rec_size); /*write log rec&update lsn*/ Xsem_give(&page.sem);} /* page consistent*/ else /* if lock failed, caller doesn't own bit, */ Abort_Work(); /* in that case abort caller's trans */ return; }; ECE 569
get_bit() int get_bit(void) /* allocate a bit to and returns bit index */ { int i; /* loop variable */ Xsem_get(&page.sem); /* get the page semaphore */ for ( i = 0; i<BITS; i++) /* loop looking for a free bit */ {if (page.bit[i]) /* if bit is free, may be dirty (so locked) */ {if (LOCK_GRANTED =lock(i,LOCK_X,LOCK_LONG,0));/* lock bit */ { page.bit[i] =FALSE; /* got lock on it */ log_rec.value = FALSE; /* generate log rec describing update*/ log_rec.index = i; page.lsn = log_insert(log_rec,rec_size); /* write log rec&updatelsn */ Xsem_give(&page.sem); /* page now consistent, give up sem*/ return i; /* return to caller */ }; }; }; /* try next free bit, */ Xsem_give(&page.sem); /* if no free bits, give up semaphore */ Abort_Work(); /* abort transaction */ return -1;}; /* returns -1 if no bits are available. */ ECE 569
Compensation Logging • Undo may generate a log record recording undo step • Makes Page LSN monotonic ECE 569
1-bit RM UNDO Callback void undo(LSN lsn) /* undo a one-bit RM operation */ { int i; /* bit index */ Boolean value; /* old bit value from log rec to be undone*/ log_rec_header header; /* buffer to hold log record header */ rec_size = log_read_lsn(lsn,header,0,log_rec,big); /* read log rec */ Xsem_get(&page.sem); /* get the page semaphore */ i = log_rec.index; /* get bit index from log record */ value = ! log_rec.value; /* get complement of new bit value*/ page.bit[i] = value; /* update bit to old value */ log_rec.value= value; /* make a compensation log record */ page.lsn = log_insert(log_rec,rec_size); /* log it and bump page lsn*/ Xsem_give(&page.sem); /* free the page semaphore */ return; } ECE 569
1-bit RM REDO Callback void redo( LSN lsn) /* redo an free space operation */ { int i; /* bit index */ Boolean value; /* new bit value from log rec to be redone*/ log_rec_header header; /* buffer to hold log record header */ rec_size = log_read_lsn(lsn,header,0,log_rec,big); /* read log record */ i = log_rec.index; /* Get bit index */ lock(i,LOCK_X,LOCK_LONG,0); /* get lock on the bit (often not needed) */ Xsem_get(&page.sem); /* get the page semaphore */ if (page.lsn < lsn) /* if bit version older than log record */ { value= log_rec.value; /* then redo the op. get new bit value */ page.bit[i] = value; /* apply new bit value to bit */ page.lsn = lsn; } /* advance the page lsn */ Xsem_give(&page.sem); /* free the page semaphore */ return; } ECE 569
1-BIT Rm Noise Callbacks Boolean prepare(LSN * lsn) /* 1-bit RM has no phase 1 work */ {*lsn = NULLlsn; return TRUE ;}; /* */ void Commit(void ) /* Commit release locks & */ { unlock_class(LOCK_LONG, TRUE, MyRMID()); }; /* return */ void Abort(void ) /* Abort release all locks & */ { unlock_class(LOCK_LONG, TRUE, MyRMID()); }; /* return */ Boolean savepoint((LSN * lsn) /* no work to do at savepoint */ {*lsn = NULLlsn; return TRUE ;}; void UNDO_savepoint(LSN lsn) /* rollback work or abort transaction*/ {if (savepoint == 0) /* if at savepoint zero (abort) */ unlock_class(LOCK_LONG, TRUE, MyRMID()); /* release all locks*/ }; ECE 569
Summary • Model: Complex actions are a page action sequence. • LSN: Each page carries an LSN and a semaphore. • ReadFix: Read acquires semaphore in shared mode. • WriteFix: Update actions (1) get semaphore in exclusive mode, (2) generate one or more log records covering the page, (3) advance the page LSN to match highest LSN (4) give semaphore • WAL: log_flush(page.LSN) before overwriting persistent page • FORCE AT COMMIT: force all log records up to the commit LSN at commit • Compensation Logging: Invalidate undone log record with a compensating log record. • Idempotence via LSN: page LSN makes REDO idempotent ECE 569
Two Phase Commit • Getting two or more logs to agree • Getting two or more RMs to agree • Atomically and Durably • Even in case one of them fails and restarts. • The TM phases • Prepare. Invoke each joined RM asking for its vote. • Decide. If all vote yes, durably write commit log record. • Commit. Invoke each joined RM, telling it commit decision. • Complete. Write commit completion when all RM ACK. ECE 569
Committing Committed Prepared Active Null Aborting Aborted Centralized Case of Two Phase Commit • Each participant: (TM &RM) goes through a sequence of states ECE 569
Transitions in Case of Restart Active state not persistent, others are persistent For both TM and RM. Log records make them persistent (redo) TM tries to drive states to the right. (to committed, aborted) Prepared Committing Committed Active Null Aborted Aborting ECE 569
Successful two phase commit • Call flow from TM to each RM joined to transaction • If TM and RM share the same log, • the RM FORCE can piggyback on the TM FORCE • One IO to commit a transaction (less if commit is grouped) ECE 569
Abort Two Phase Commit • If RM sends "NO" or no response (timeout), TM starts abort. • Calls UNDO of each trans log record • May stop at a savepoint. • At begin_trans it calls ABORT() callback of each joined RM ECE 569
Full Transaction State Diagram ECE 569
CHECKPOINTING • Commit consistent checkpoints • Stop admitting new transactions and wait until all active transactions complete (abort or commit) • Flush all dirty cache slots • Write checkpoint record to log • During recovery, begin forward scan at last checkpoint record. • After last checkpoint, every element in DB contained its last committed value. • If an element does not contain its last committed value, it must have been updated after the checkpoint ECE 569
Fuzzy Checkpointing • Commit consistent checkpoint has two drawbacks: • A lot of disk I/O is needed (1000 pages @ 5mS/page = 5sec) • Must wait until all active transactions terminate (~2 sec.) • Protocol • Stop processing new operations- Wait until all active ones complete. • Flush every cache slot that has not been flushed since last checkpoint. Stable-LSN < checkpoint-LSN. • Update stable-LSN of all buffers flushed. • Write checkpoint record including- • Active transaction list • List of data items and stable-LSNs of all dirty slots ECE 569
Restart Algorithm • Locate Penultimate Checkpoint (the checkpoint preceding the last one) • Add all transactions in checkpoint record to active transaction list. • Forward scan of log (starting at penultimate checkpoint) • Call rm_redo() for each log record • On BEGIN_TRANSACTION log record add transaction to active transaction list • On COMMIT or ABORT log record, remove transaction from active transaction list. ECE 569
Restart Algorithm (Cont.) • For each transaction in active transaction list • Retreive the transaction’s log records in reverse order (last record of transaction retreived first) • For each log record, call rm_undo() ECE 569
Simple Recovery Method (Bernstein) • Restart Algorithm • redone = undone = • Scan log from last record to first. For each log record [Ti, x, vbefore, vafter] do • If x (redone undone) then • If Ti is committed then • restore x’s cache slot to vafter • redone = redone {x} • Otherwise • restore x’s cache slot to vbefore • undone = undone {x} ECE 569
Simple Recovery • Assumptions • Strict 2PL - locking at granularity of page • Before and after-image are complete pages • Every element x is restored to its last committed value by Restart. • If last update to x is by a committed transaction, the value it wrote is restored and no further changes are made. (x redone) • If last update to x is by aborted or active transaction Ti, the before image of x wrt Ti is restored and no further change to x is made. (x undone) • Because histories are strict, this value was written by the last transaction to commit and write x. ECE 569
Record Level Locking • History is not strict with respect to pages anymore. It is, however, strict with respect to individual tuples. • A single page LSN is not enough. Consider the following example: • r1, r2, and r3 are records on the same page. Log: wi[r1] ci wj[r2] wk[r3] ck aj Page LSN: 1 3 4 ? • The abort of Tj is processed by restoring the before image of r2 in the page. • What should we do to the LSN? • If we leave it at 4 and a system failure occurs, what will happen? • If we set it to 3 what happens? ECE 569