590 likes | 695 Views
LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA. Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split. Wook-Hee Kim 1 , Beomseok Nam 1 , Dongil Park 2 , Youjip Won 2
E N D
LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim1, Beomseok Nam1, Dongil Park2, Youjip Won2 1 Ulsan National Institute of Science and Technology, Korea 2 Hanyang University, Korea
Outline • Motivation • Journaling of Journal Anomaly • How to resolve Journaling of Journal anomaly • Multi-Version B-Tree (MVBT) • Optimizations of Multi-Version B-Tree • Lazy Split • Reserved Buffer Space • Lazy Garbage Collection • Metadata Embedding • Disabling Sibling Redistribution • Evaluation • Demo
Storage in Android platform Performance Bottleneck Major cause for device break down. 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 Cause = Excessive IO
Android I/O Stack Insert/Update/ Delete/Select Misaligned Interaction SQLite Read/Write EXT4 Read/Write Block Device Driver
SQLite insert (Truncate mode) insert SQLite O W fsync W fsync W T Open rollback journal. Record the data to journal. Put commit mark to journal. Insert entry to DB Truncate journal.
EXT4 Journaling (ordered mode) W SQLite write() D J M C EXT4 Write data block Write EXT4 journal Write journal metadata Write journal commit
Journaling of Journal Anomaly insert fsync O W SQLite W fsync W D J M C D D J M C EXT4 One insert of 100 Byte 9 Random Writes of 4KByte Write data block Write EXT4 journal Write journal metadata Write journal commit
How to Resolve Journaling of Journal? insert SQLite fsync O W W fsync W Database Journaling Database Insertion D J M C D D J M C EXT4
How to Resolve Journaling of Journal? Reducing the number of fsync() calls fsync() fsync() fsync() Database Insertion Database Journaling Journaling + Database = Version-based B-Tree (MVBT)
Version-based B-Tree (MVBT) Insert 10 Update 10 with 20 Time T2 T1 20 [T2, ∞) 10 [T1, ∞) 10 [T1, T2) 10 [T1, T2) Do not overwrite old version data Do not need a rollback journal
Node Split in Multi-Version B-Tree Page 1 Key = 25 = 5 12 [4~∞) 40 [1~∞) 5 [3~∞) 10 [2~∞)
Node Split in Multi-Version B-Tree Dead Node Page 1 Key = 25 = 5 12 [4~5) 40 [1~5) 5 [3~5) 10 [2~5)
Node Split in Multi-Version B-Tree Page 1 Key = 25 = 5 12 [4~5) 40 [1~5) 5 [3~5) 10 [2~5)
Node Split in Multi-Version B-Tree Page 3 Page 2 Page 1 Key = 25 = 5 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5) 5 [5~∞) 5 [3~5) 10 [5~∞) 10 [2~5)
Node Split in Multi-Version B-Tree Page 2 Page 4 Page 1 Page 3 ∞ [0~5) : Page 1 12 [5~∞) 12 [4~5) 40 [1~5) 40 [5~∞) 5 [5~∞) 5 [3~5) 10 [5~∞) : Page 3 10 [5~∞) 10 [2~5) ∞ [5~∞) : Page 2 25 [5~∞)
Node Split in Multi-Version B-Tree One more dirty page than original B-Tree dirty Page 3 Page 4 Page 2 Page 1 ∞ [0~5) : Page 1 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5) dirty dirty dirty 10 [5~∞) : Page 3 5 [3~5) 5 [5~∞) 10 [2~5) ∞ [5~∞) : Page 2 10 [5~∞) 25 [5~∞)
I/O Traffic in MVBT DB buffer cache fsync() MVBT Reduce the number of dirty pages!
I/O Traffic in MVBT DB buffer cache fsync() MVBT DB buffer cache fsync() LS -MVBT
Optimizations in Android I/O tread >> twrite Read Write Write Transaction 1 Multiple Insert Write Transaction 2 Single Insert Transaction Write Transaction 3
Lazy Split Multi-Version B-Tree (LS-MVBT) Optimizations Lazy Split LS-MVBT Reserved Buffer Space Disabling Sibling Redistribution Lazy Garbage Collection Metadata Embedding
Optimization1: Lazy Split • Legacy Split in MVBT dirty Page 4 Page 2 Page 1 Page 3 ∞ [0~5) : Page 1 4 dirty pages 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5) dirty dirty dirty 5 [5~∞) 10 [5~∞) : Page 3 5 [3~5) ∞ [5~∞) : Page 2 10 [2~5) 10 [5~∞) 25 [5~∞)
Optimization1: Lazy Split Page 1 Key = 25 = 5 12 [4~∞) 40 [1~∞) 5 [3~∞) 10 [2~∞)
Optimization1: Lazy Split Lazy Node: Half-dead, Half-live Page 1 Key = 25 5 [3~∞) 10 [2~∞) = 5 12 [4~5) 40 [1~5)
Optimization1: Lazy Split Page 1 Key = 25 5 [3~∞) 10 [2~∞) = 5 12 [4~5) 40 [1~5)
Optimization1: Lazy Split Page 2 Page 1 Key = 25 5 [3~∞) 10 [2~∞) = 5 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5)
Optimization1: Lazy Split Page 2 Page 1 Page 3 5 [3~∞) 10 [5~∞) : Page 1 12 [5~∞) 10 [2~∞) ∞ [0~5) : Page 1 12 [4~5) ∞ [5~∞) : Page 2 40 [5~∞) 40 [1~5) 25 [5~∞)
Optimization1: Lazy Split • Lazy Node Overflow Key = 8 Page 3 Page 1 10 [5~∞) : Page 1 5 [3~∞) = 6 10 [2~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 12 [4~5) 40 [1~5) Page 2 12 [5~∞) Dead entries 25 [5~∞) 40 [5~∞)
Optimization1: Lazy Split • Lazy Node Overflow → Garbage collect dead entries Key = 8 Page 3 10 [5~∞) : Page 1 = 6 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 5 [3~∞) 10 [2~∞) 12 [5~∞) 25 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5)
Optimization1: Lazy Split • Lazy Node Overflow → Garbage collect dead entries Key = 8 Page 3 10 [5~∞) : Page 1 = 6 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 5 [3~∞) 8 [6~∞) 12 [5~∞) 25 [5~∞) 10 [2~∞) 40 [5~∞) But, what if the dead entries are being accessed by other transactions?
Optimization2: Reserved Buffer Space • Option 1. Wait for read transactions to finish Key = 8 Page 3 Page 2 10 [5~∞) : Page 1 = 6 12 [5~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 40 [5~∞) Page 1 5 [3~∞) 10 [2~∞) 25 [5~∞) 12 [4~5) 40 [1~5)
Optimization2: Reserved Buffer Space • Option 2. Split as in legacy MVBT split Key = 8 Page 3 Page 2 10 [5~∞) : Page 1 = 6 12 [5~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 40 [5~∞) Page 4 Page 1 5 [6~∞) 5 [3~6) 10 [6~∞) 10 [2~6) 12 [4~5) 25 [5~∞) 40 [1~5)
Optimization2: Reserved Buffer Space • Option 2. Split as in legacy MVBT split Page 2 Page 3 10 [5~6) : Page 1 10 [6~∞) : Page 3 12 [5~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 40 [5~∞) Page 4 Page 1 5 [6~∞) 5 [3~6) 8 [6~∞) 10 [2~6) 25 [5~∞) 10 [6~∞) 12 [4~5) 40 [1~5)
Optimization2: Reserved Buffer Space • Option 3. Pad some buffer space in tree nodes Key = 8 Page 3 10 [5~∞) : Page 1 = 6 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 12 [5~∞) 10 [2~∞) Buffer space is used when lazy node is full 25 [5~∞) 12 [4~5) 40 [1~5) 40 [5~∞) If buffer space is also full, split as in legacy MVBT 8 [6~∞)
Rollback in LS-MVBT • Similar to rollback in MVBT Rollback Page 3 Page 3 Page 1 Page 3 Page 1 Page 3 Page 1 Page 1 Txn 5 crashes 5 [3~∞) 5 [3~∞) ∞ [0~∞) : Page 1 5 [3~∞) 5 [3~∞) 10 [5~∞) : Page 1 10 [5~∞) : Page 1 10 [2~∞) 10 [2~∞) 10 [2~∞) ∞ [0~5) : Page 1 10 [2~∞) ∞ [0~5) : Page 1 ∞ [0~5) : Page 1 • Number of dirty nodes touched by rollback of LS-MVBT is also smaller than that of MVBT. 12 [4~5) ∞ [5~∞) : Page 2 12 [4~5) 12 [4~5) 12 [4~∞) ∞ [5~∞) : Page 2 40 [1~ ∞) 40 [1~5) 40 [1~5) 40 [1~5) Page 2 12 [5~∞) 25 [5~∞) 40 [5~∞)
Optimization3: Lazy Garbage Collection Periodic Garbage Collection Transaction 1 Transaction buffer cache P1 P2 P3 P1
Optimization3: Lazy Garbage Collection Periodic Garbage Collection Transaction 1 Stopped Transaction buffer cache P1 P2 P3 P1 P2 P3 fsync() Garbage Collection Extra Dirty Pages GC buffer cache
Optimization3: Lazy Garbage Collection Lazy Garbage Collection Do not garabge collect if no space is needed Transaction 1 fsync() Insert P2 P3 P1 P1 Transaction buffer cache No Extra Dirty Page by GC
Optimization4: Metadata embedding Version = “File Change Counter” in DB header page Header Page 1 dirty 5 6 ++ Write Transaction 6 Page 2 ... 2 dirty pages Commit Read Transaction 5 dirty Page N 15 [6~∞) 25 [5~∞) 40 [1~∞)
Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Header Page 1 Write Transaction 6 Page 2 ... 5 Commit Read Transaction 5 6 Page N 5 RAMDISK 15 [6~∞) 25 [5~∞) 40 [1~∞) 6 5
Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Header Page 1 Write Transaction 6 Page 2 ... 5 CRASH Commit Read Transaction 5 Page N 6 dirty RAMDISK 15 [6~∞) 25 [5~∞) 40 [1~∞) 1 dirty page 6 6 Volatile
Optimization5: Disable Sibling Redistribution • Sibling redistribution hurts insertion performance • Disable sibling redistribution → avoid dirtying 4 pages Page 5: Page 5: dirty dirty 12 10 10 : Page 4 10 : Page 4 Insertion of key 20 40 25 • : Page 3 • : Page 3 ∞ : Page 2 ∞ : Page 2 4 dirty pages Page 3: dirty dirty dirty Page 4: Page 2: 12 55 ∞ 5 10 15 20 25 40
Optimization5: Disable Sibling Redistribution • Disabled sibling redistribution Page 5: dirty 10 : Page 4 • : Page 3 15 : Page 5 Insertion of key 20 90 : Page 2 Page 3: dirty Page 4 Page 2 dirty Page 5: 20 12 55 90 5 10 15 25 40 3 dirty pages
Summary Optimizations Lazy Split Avoid dirtying an extra page when split occurs Reserved Buffer Space Disabling Sibling Redistribution Reduce the probability of node split LS-MVBT Do not touch siblings to make search faster Lazy Garbage Collection Metadata Embedding Delete dead entries on the next mutation Avoid dirtying header page
Performance Evaluation • Implemented LS-MVBT in SQLite 3.7.12 • Testbed: Samsung Galaxy-S4 (JellyBean 4.3) • Exynos 5 Octa 5410 Quad Core 1.6GHz • 2GB Memory • 16GB eMMC • Performance Analysis Tools • Mobibench • MOST
Analysis of insert Performance SQLite Insert (Response Time) LS-MVBT fsync WAL fsync() MVBT fsync() B-tree computation LS-MVBT: 30~78% faster than WAL mode WAL Checkpointing Interval
Analysis of insert Performance SQLite Insert (# of Dirty Pages per Insert) LS-MVBT flushes only one dirty page
I/O Traffic at Block Device Level SQLite Insert (I/O Traffic per 10 msec) 1000 insertions (LS-MVBT) 1000 insertions (WAL) LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL) x3 longer life time of NAND flash
How about Search performance? • LS-MVBT is optimized for write performance.
Search Performance Transaction Throughput with varying Search/Insert ratio LS-MVBT wins unless more than 93% of transactions are search