210 likes | 337 Views
MVCC on Flash Memory. Fan Yulei, Lab of WAMDM, School of Information, Renmin University of China, Beijing, China, 2009-06-13. Outline. Motivation. MVCC. Berkeley DB. PostgreSQL. Future work. Motivation. Characteristics Not In-Place Update. HDD. Flash. Motivation. Kinds of Lock.
E N D
MVCC on Flash Memory Fan Yulei, Lab of WAMDM, School of Information, Renmin University of China, Beijing, China, 2009-06-13
Outline Motivation MVCC Berkeley DB PostgreSQL Future work
Motivation • Characteristics • Not In-Place Update HDD Flash
Motivation Kinds of Lock 1st : Lock 2nd : Release Lock Log File & Data File Checkpoint: D & S 2PL Read Log file Undo & Redo Multiple Version Read Log file Undo & Redo Backup Database Hot-standby : mirrored media Directed Acycling Graph • CC • 2PL • MVCC • Conflict graph • Timestamp • Index CC • Recovery • Log • Transaction • Media Timestamp Ordering Index : B+-Tree Transaction MVCC Snapshot Isolation
MVCC Conflict cycle: t1,t2 • Monoversion Schedule • s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2 • s’ = r1(x) w1(x) r2(x) r1(y) w2(y) w1(z) c1 c2 • Multiversion Schedule & Monoversion Schedule • Multiversion Schedule • m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y0) w1(z1) c1 c2 • h(ri(x))=wj(x) & h(wi(x))=wi(x): version function • Monoversion Schedule • m= r1(x0) w1(x1) r2(x1) w2(y2) r1(y2) w1(z1) c1 c2 • s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2 • Monoversion Schedule is a special case of Multiversion Schedule
MVCC • Traditional Conflict • s = w0(x) c0 w1(x) c1 r2(x) w2(y) c2 • m = w0(x0) c0 w1(x1) c1 r2(x0) w2(y2) c2 • View Equivalent • Reads-From Relationship • RF(m) := {(ti, x, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)} • View Equivalent • trans(m) = trans(m’) and RF(m) = RF(m’) • Example • m = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2)c2 • m’ = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w2(y2)w3(x3) c3 c2
MVCC • Multiversion View Serializability • Serializable but not View Equivalent • m = w0(x0) w0(y0)c0 r1(x0) r1(y0) w1(x1) w1(y1)c1 r2(x0) r2(y1)c2 • s = w0(x) w0(y)c0 r1(x) r1(y) w1(x) w1(y)c1 r2(x) r2(y)c2 • MVSR • m’ is a serialized monoversion schedule • trans(m) = trans(m’) • m and m’ are view equivalent • Example • m = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w3(x3) c3 w2(y2)c2 • m’ = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2)c2 • s = w0(x) w0 (y) c0 r3(x) w3(x) c3 w1(x) c1 r2(x) w2(y)c2
MVCC r2(x0)r2(y1) r2(x1)r2(y0) • Conflict Graph G(m) = (V , E) • V = trans(m) ; • E = {(ti, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}} • m and m’ are View Equivalent => G(m) = G(m’) • Version Oder • m = w0(x0) w0 (y0) w0 (z0) c0 r1(x0) r2(x0) r2(z0) r3(z0) w1(y1) w2(x2) w3(y3) w3(z3) c1c2c3 r4(x2)r4(y3) r4(z3)c4 • Version Oder = {x0«x2, y0«y1«y3, z0«z3} • MVSG • MVSG = G(m) + Version Order • rk(xj) and wi(xi), k≠i≠j • If xi « xj then (ti, tj) ∈E; else (tk, ti) ∈E • M ∈ MVSR iff MVSG(m, «) have no cycle T1 T3 T4 T0 T2
MVCC • Multiversion Conflict • ri(xj) and wk(xk) and ri(xj) < wk(xk) • Multiversion Conflict Serializability • m’ is a serialized monoversion schedule • trans(m) = trans(m’) • Pair of operations with conflict: same ordering • Multiversion Conflict Graph • E={(ti, tk) | ri(xj) < wk(xk) } • M ∈ MVCR iff MSVG(m, «) have no cycle all MVSR MCSR VSR CSR
MVCC • Limit the number of version: k=2 • w0(x0) c0 r1(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(x2) c2 • w0(x0) c0 r1(x0) w1(x1) c1 r2(x1) w2(x2) c2 w3(x3) c3 • w0(x0) c0 r1(x0) w1(x1) c1 w3(x3) c3 r2(x3) w2(x2) c2 • w0(x0) c0 r2(x0) w2(x2) c2 r1(x2) w1(x1) c1 w3(x3) c3 • w0(x0) c0 r2(x0) w2(x2) c2 w3(x3) c3 r1(x3) w1(x1) c1 • w0(x0) c0 w3(x3) c3 r1(x3) w1(x1) c1 r2(x1) w2(x2) c2 • w0(x0) c0 w3(y3) c3 r2(x3) w2(x2) c2 r1(x2) w1(x1) c1 • K-version view serializability (kVSR): • Serializable • View equivalent • k newest/nearest version • Hierarchy Relationship x1,x2 x2,x3 x2,x3 x1,x3 x1,x3 x1,x2 x1,x2
MVCC • MVCC Protocol • MVTO (multiversion timestamp ordering) • MV2PL : 2VPL • three kinds of kinds: rl, wl, cl • MVSGT • ROMV • Read-only transaction
Berkeley DB a standalone utility • Five components • Deadlock detection • db_deadlock • DB_ENV->lock_detect, DB_ENV->set_lk_detect • Checkpoints • db_checkpoint • DB_ENV->txn_checkpoint • Database and log file archival • db_archive • DB_ENV->log_archive • Log file removal • db_archive • DB_ENV->log_archive • Recovery procedures • db_recover • DB_ENV->open one or more library interfaces
Berkeley DB • Transaction API • Transaction Subsystem and Related Methods Description • DB_ENV->txn_checkpoint, DB_ENV->txn_recover DB_ENV->txn_stat • DB_ENV->open DB_ENV->close DB_ENV->remove • Transaction Subsystem Configuration • DB_ENV->set_timeout DB_ENV->set_tx_max DB_ENV->set_tx_timestamp • Transaction Operations • DB_ENV->txn_begin DB_TXN->abort DB_TXN->commit DB_TXN->discard DB_TXN->id DB_TXN->prepare DB_TXN->set_name DB_TXN->set_timeout
Berkeley DB • 2PL In Berkeley DB • Locks are released • during DB_TXN->abort or DB_TXN->commit. • Guidelines: • If possible, use nested transactions to protect the parts of your transaction most likely to deadlock • Transaction limits • Transaction IDs: 31-bit unsigned integer (OX80000000) • Cursors: can not span more transactions, must be opened and closed within a single transaction • Multiple Threads of Control:
Berkeley DB • Several filesystem operations on Berkeley DB • Disk seek to database file, Database file read, Disk seek to log file, Log file write, Disk seek to update log file metadata, Log metadata write, Flush log file information to disk, Flush log file metadata to disk • Ways to increase transactional throughput • Berkeley DB software support group commit • Additional tuning parameters • Tune the size of the database cache • Put the database and the log files on different disks • Set the filesystem configuration • Upgrade your hardware • Turn on DB_TXN_WRITE_NOSYNC or DB_TXN_NOSYNC flags • ACI, but not D
PostgreSQL • PG: a sanpshot of data • Reading never blocks writing • Writing never blocks reading • Three undesirable phenomena • dirty reads, non-repeatable reads, phantom read • SQL Transaction Isolation Levels
PostgreSQL • Read Committed Isolation Level • the default isolation level • A SELECT query sees only data committed • The SELECT does see the effects of previous updates executed within this same transaction • Two successive SELECTs can see different data • Other transactions commit changes during executions • NOT adequate for many applications that do complex queries and updates • Serializable Isolation Level • This level emulates serial transaction execution.
PostgreSQL • Data consistency checks at the application level • Readers in PostgreSQL don't lock data • To ensure the current existence of a row and protect it against concurrent updates one must use SELECT FOR UPDATE or an appropriate LOCK TABLE statement. (SELECT FOR UPDATE locks just the returned rows against concurrent updates, while LOCK TABLE protects the whole table.) • Lock and Tables • Table-level Lock • Row-level : when rows are being updated • Lock and Index • Gist and R-tree : released after statement is done • Hash Index : released after page is processed • B-Tree : released immediately after each index tuple is fetched/inserted
Future work • Experiment • BDB & PG Code • Transaction on Flash Memory • Concurrency Control • MVCC • Recovery • Log