260 likes | 351 Views
Design Patterns for Tunable and Efficient SSD-based Indexes. Ashok Anand , Aaron Gember -Jacobson , Collin Engstrom , Aditya Akella. Large hash-based indexes. ≈20K lookups and inserts per second (1Gbps link). ≥ 32GB hash table. WAN optimizers [ Anand et al. SIGCOMM ’08].
E N D
Design Patterns for Tunable and Efficient SSD-based Indexes Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, AdityaAkella
Large hash-based indexes ≈20K lookups and inserts per second (1Gbps link) ≥ 32GB hash table WAN optimizers[Anand et al. SIGCOMM ’08] VideoProxy[Anand et al. HotNets ’12] De-duplicationsystems[Quinlan et al. FAST ‘02]
Use of large hash-based indexes Where to store the indexes? WANoptimizers VideoProxy De-duplicationsystems
Where to store the indexes? SSD 8x less 25x less
What’s the problem? • Need domain/workload-specific optimizations for SSD-based index with↑ performance and ↓overhead • Existing designs have… • Poor flexibility – target a specific point in the cost-performance spectrum • Poor generality – only apply to specific workloads or data structures False assumption!
Our contributions • Design patterns that ensure: • High performance • Flexibility • Generality • Indexes based on these principles: • SliceHash • SliceBloom • SliceLSH
Outline • Problem statement • Limitations of state-of-the-art • SSD architecture • Parallelism-friendly design patterns • SliceHash (streaming hash table) • Evaluation
State-of-the-art SSD-based index • BufferHash[Anand et al. NSDI ’10] • Designed for high throughput 0 1 In-memoryincarnation 2 3 KA,VA 2 2 #( ) K Bloom filter KC,VC 0 0 0 K,V K,V incarnation KB,VB 0 KA,VA 1 1 1 K,V K,V K,V 1 2 2 2 K,V K,V 2 KC,VC 3 3 3 K,V K,V 3 KB,VB 4 bytes perK/V pair! 16 page reads in worst case! (average: ≈1)
State-of-the-art SSD-based index • SILT [Lim et al. SOSP ‘11] • Designed for low memory + high throughput 0 1 Hash table 2 Target specific workloads and objectives → poor flexibility and generality 3 KA,VA K,V K,V KB,VB K,V Index KC,VC K,V Log Sorted Hash K,V 0 1 2 Do not leverage internal parallelism 3 ≈0.7 bytesper K/V pair 33 page reads in worst case!(average: 1) High CPU usage!
SSD Architecture Flash mempackage 1 Die 1 Plane 1 Block 1 Block 2 Page 1 Page 1 Plane 2 Page 2 Page 2 Die n How does the SSD architecture inform our design patterns? … Flash mempkg4 Plane 1 Data register … Plane 2 Channel 1 SSD controller … Channel 32 Flash mempkg 125 Flash mempkg 126 Flash mempkg 128 …
Four design principles • Store related entries on the same page • Write to the SSD at block granularity • Issue large reads and large writes • Spread small reads across channels Flash memorypackage 1 Flash memorypackage 1 Page 1 Page 2 SliceHash Block 2 Block 1 Flash memory package 4 Flash memory package 4 Page 1 … Channel 1 … Channel 32
I. Store related entries on the same page • Many hash table incarnations, like BufferHash 2 3: K,V 0: K,V 1: K,V 4 5: K,V 6: K,V 7: K,V 5 5 #( ) K Incarnation Page 4: K,V 5 6: K,V 7: K,V Multiple page reads per lookup! 0: K,V 1 2: K,V 3: K,V Sequential slots from a specific incarnation 2: K,V 3 0: K,V 1: K,V 4: K,V 5: K,V 6 7: K,V
I. Store related entries on the same page • Many hash table incarnations, like BufferHash • Slicing: store same hash slot from all incarnations on the same page 0: K,V 0: K,V 2 0: K,V 3: K,V 0: K,V 1: K,V 4 1: K,V 5: K,V 1 1: K,V 6: K,V 7: K,V 5 Only 1 pageread per lookup! 2 0: K,V 2: K,V 1 2: K,V 2: K,V 3: K,V Slice Page 4: K,V 3: K,V 5 3: K,V 6: K,V 3 7: K,V 4 4: K,V 4: K,V 5: K,V 5 5: K,V 6: K,V 6: K,V 6 Specific slot from all incarnations 7: K,V 7: K,V 7: K,V Incarnation 2: K,V 3 0: K,V 1: K,V 4: K,V 5: K,V 6 7: K,V
II. Write to the SSD at block granularity • Insert into a hash table incarnation in RAM • Divide the hash table so all slices fit into one block SliceTable 0 0: K,V 0: K,V 0: K,V KA,VA 1 1: K,V 1 1: K,V KD,VD Block KA,VA 2 KF,VF 2 2: K,V 2: K,V KD,VD 3 KE,VE 3: K,V 3: K,V 3 KF,VF 4 KC,VC KE,VE 4 4: K,V 4: K,V 5 KB,VB KC,VC 5: K,V 5 5: K,V 6 KB,VB 6: K,V 6: K,V 6 7 7: K,V 7: K,V 7: K,V Incarnation
III. Issue large reads and large writes Packageparallelism Package 1 Package 2 Package 3 Package 4 Reg Page Reg Page Reg Reg Page Page Channelparallelism Page size Channel 1 Channel 2
III. Issue large reads and large writes SSD assigns consecutive chunks (4 pages/8KB) to different channels Block size Channelparallelism
III. Issues large reads and large writes 0: K,V 0: K,V 0: K,V 0 1: K,V 1 1: K,V 1: KA,VA 2 2: K,V 2: K,V 2: KD,VD • Read entire SliceTableinto RAM • Write entire SliceTableonto SSD 3: K,V 3: K,V 3 3: KF,VF 0 KA,VA 1: KA,VA KD,VD 2: KD,VD KF,VF 3: KF,VF 0 0: K,V 0: K,V 0: K,V 0: K,V 0: K,V 0: K,V 1 1: K,V 1: K,V 1 1 1: K,V 1: K,V 2 (Block) 2 2 2: K,V 2: K,V 2: K,V 2: K,V 3 3: K,V 3: K,V 3: K,V 3: K,V 3 3 4 4: K,V 4: K,V 5: K,V 5 5: K,V 6: K,V 6: K,V 6 7: K,V 7: K,V 7: K,V
IV. Spread small reads across channels • Recall: SSD writes consecutive chunks (4 pages) of a block to different channels • Use existing techniques to reverse engineer [Chen et al. HPCA ‘11] • SSD uses write-order mapping channel for chunk i = i modulo (# channels)
IV. Spread small reads across channels • Estimate channel using slot # and chunk size • Attempt to schedule 1 read per channel 2 1 4 5 0 Channel 0 Channel 1 1 (slot # * pages per slot) modulo (# channels * pages per chunk) ( * pages per slot) modulo (# channels * pages per chunk) Channel 2 Channel 3 4 1
SliceHash summary Specific slot from all incarnations SliceTable 4 4: K,V 4: K,V KA,VA 5: K,V 5 5: K,V KD,VD 0 0: K,V 0: K,V 0: K,V KA,VA Page Slice KF,VF 6: K,V 6: K,V 6 1 1: K,V 1 1: K,V KD,VD Block KE,VE 7: K,V 7: K,V 7: K,V 2 0: K,V 0: K,V 0: K,V KF,VF 2 2: K,V 2: K,V KC,VC 3 1: K,V 1 1: K,V KE,VE 3: K,V 3: K,V 3 KB,VB 4 KC,VC 2 2: K,V 2: K,V 4 4: K,V 4: K,V 5 KB,VB 3: K,V 3: K,V 3 5: K,V 5 5: K,V 6 6: K,V 6: K,V 6 7 Incarnation 7: K,V 7: K,V 7: K,V Read/write when updating In-memoryincarnation
Evaluation: throughput vs. overhead See paper for theoretical analysis 128GBCrucial M4 8B key8B value ↑15% ↑2.8x ↑6.6x ↓12% 50% insert50% lookup 2.26Ghz4-core
Evaluation: flexibility • Trade-off memory for throughput 50% insert50% lookup Use multiple SSDs for even ↓ memory use and ↑ throughput
Evaluation: generality Memory (bytes/entry) • Workload may change Constantly low! CPU utilization (%) Decreasing!
Summary • Present design practices for low cost and high performance SSD-based indexes • Introduce slicing to co-locate related entries and leverage multiple levels of SSD parallelism • SliceHash achieves 69K lookups/sec (≈12% better than prior works), with consistently low memory (0.6B/entry) and CPU (12%) overhead
Evaluation: theoretical analysis • Parameters • 16B key/value pairs • 80% table utilization • 32 incarnations • 4GB of memory • 128GB SSD • 0.31ms to read a block • 0.83ms to write a block • 0.15ms to read a page overhead 0.6 B/entry cost avg: ≈5.7μsworst: 1.14ms cost avg & worst: 0.15ms
Evaluation: theoretical analysis BufferHash 4B/entry avg: ≈0.2us worst: 0.83ms avg: ≈0.15ms worst: 4.8ms overhead 0.6 B/entry cost avg: ≈5.7μsworst: 1.14ms cost avg & worst: 0.15ms