Design Patterns for Tunable and Efficient SSD-based Indexes

Design Patterns for Tunable and Efficient SSD-based Indexes Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, AdityaAkella

Large hash-based indexes ≈20K lookups and inserts per second (1Gbps link) ≥ 32GB hash table WAN optimizers[Anand et al. SIGCOMM ’08] VideoProxy[Anand et al. HotNets ’12] De-duplicationsystems[Quinlan et al. FAST ‘02]

Use of large hash-based indexes Where to store the indexes? WANoptimizers VideoProxy De-duplicationsystems

Where to store the indexes? SSD 8x less 25x less

What’s the problem? • Need domain/workload-specific optimizations for SSD-based index with↑ performance and ↓overhead • Existing designs have… • Poor flexibility – target a specific point in the cost-performance spectrum • Poor generality – only apply to specific workloads or data structures False assumption!

Our contributions • Design patterns that ensure: • High performance • Flexibility • Generality • Indexes based on these principles: • SliceHash • SliceBloom • SliceLSH

Outline • Problem statement • Limitations of state-of-the-art • SSD architecture • Parallelism-friendly design patterns • SliceHash (streaming hash table) • Evaluation

State-of-the-art SSD-based index • BufferHash[Anand et al. NSDI ’10] • Designed for high throughput 0 1 In-memoryincarnation 2 3 KA,VA 2 2 #( ) K Bloom filter KC,VC 0 0 0 K,V K,V incarnation KB,VB 0 KA,VA 1 1 1 K,V K,V K,V 1 2 2 2 K,V K,V 2 KC,VC 3 3 3 K,V K,V 3 KB,VB 4 bytes perK/V pair! 16 page reads in worst case! (average: ≈1)

State-of-the-art SSD-based index • SILT [Lim et al. SOSP ‘11] • Designed for low memory + high throughput 0 1 Hash table 2 Target specific workloads and objectives → poor flexibility and generality 3 KA,VA K,V K,V KB,VB K,V Index KC,VC K,V Log Sorted Hash K,V 0 1 2 Do not leverage internal parallelism 3 ≈0.7 bytesper K/V pair 33 page reads in worst case!(average: 1) High CPU usage!

SSD Architecture Flash mempackage 1 Die 1 Plane 1 Block 1 Block 2 Page 1 Page 1 Plane 2 Page 2 Page 2 Die n How does the SSD architecture inform our design patterns? … Flash mempkg4 Plane 1 Data register … Plane 2 Channel 1 SSD controller … Channel 32 Flash mempkg 125 Flash mempkg 126 Flash mempkg 128 …

Four design principles • Store related entries on the same page • Write to the SSD at block granularity • Issue large reads and large writes • Spread small reads across channels Flash memorypackage 1 Flash memorypackage 1 Page 1 Page 2 SliceHash Block 2 Block 1 Flash memory package 4 Flash memory package 4 Page 1 … Channel 1 … Channel 32

I. Store related entries on the same page • Many hash table incarnations, like BufferHash 2 3: K,V 0: K,V 1: K,V 4 5: K,V 6: K,V 7: K,V 5 5 #( ) K Incarnation Page 4: K,V 5 6: K,V 7: K,V Multiple page reads per lookup! 0: K,V 1 2: K,V 3: K,V Sequential slots from a specific incarnation 2: K,V 3 0: K,V 1: K,V 4: K,V 5: K,V 6 7: K,V

I. Store related entries on the same page • Many hash table incarnations, like BufferHash • Slicing: store same hash slot from all incarnations on the same page 0: K,V 0: K,V 2 0: K,V 3: K,V 0: K,V 1: K,V 4 1: K,V 5: K,V 1 1: K,V 6: K,V 7: K,V 5 Only 1 pageread per lookup! 2 0: K,V 2: K,V 1 2: K,V 2: K,V 3: K,V Slice Page 4: K,V 3: K,V 5 3: K,V 6: K,V 3 7: K,V 4 4: K,V 4: K,V 5: K,V 5 5: K,V 6: K,V 6: K,V 6 Specific slot from all incarnations 7: K,V 7: K,V 7: K,V Incarnation 2: K,V 3 0: K,V 1: K,V 4: K,V 5: K,V 6 7: K,V

II. Write to the SSD at block granularity • Insert into a hash table incarnation in RAM • Divide the hash table so all slices fit into one block SliceTable 0 0: K,V 0: K,V 0: K,V KA,VA 1 1: K,V 1 1: K,V KD,VD Block KA,VA 2 KF,VF 2 2: K,V 2: K,V KD,VD 3 KE,VE 3: K,V 3: K,V 3 KF,VF 4 KC,VC KE,VE 4 4: K,V 4: K,V 5 KB,VB KC,VC 5: K,V 5 5: K,V 6 KB,VB 6: K,V 6: K,V 6 7 7: K,V 7: K,V 7: K,V Incarnation

III. Issue large reads and large writes Packageparallelism Package 1 Package 2 Package 3 Package 4 Reg Page Reg Page Reg Reg Page Page Channelparallelism Page size Channel 1 Channel 2

III. Issue large reads and large writes SSD assigns consecutive chunks (4 pages/8KB) to different channels Block size Channelparallelism

III. Issues large reads and large writes 0: K,V 0: K,V 0: K,V 0 1: K,V 1 1: K,V 1: KA,VA 2 2: K,V 2: K,V 2: KD,VD • Read entire SliceTableinto RAM • Write entire SliceTableonto SSD 3: K,V 3: K,V 3 3: KF,VF 0 KA,VA 1: KA,VA KD,VD 2: KD,VD KF,VF 3: KF,VF 0 0: K,V 0: K,V 0: K,V 0: K,V 0: K,V 0: K,V 1 1: K,V 1: K,V 1 1 1: K,V 1: K,V 2 (Block) 2 2 2: K,V 2: K,V 2: K,V 2: K,V 3 3: K,V 3: K,V 3: K,V 3: K,V 3 3 4 4: K,V 4: K,V 5: K,V 5 5: K,V 6: K,V 6: K,V 6 7: K,V 7: K,V 7: K,V

IV. Spread small reads across channels • Recall: SSD writes consecutive chunks (4 pages) of a block to different channels • Use existing techniques to reverse engineer [Chen et al. HPCA ‘11] • SSD uses write-order mapping channel for chunk i = i modulo (# channels)

IV. Spread small reads across channels • Estimate channel using slot # and chunk size • Attempt to schedule 1 read per channel 2 1 4 5 0 Channel 0 Channel 1 1 (slot # * pages per slot) modulo (# channels * pages per chunk) ( * pages per slot) modulo (# channels * pages per chunk) Channel 2 Channel 3 4 1

SliceHash summary Specific slot from all incarnations SliceTable 4 4: K,V 4: K,V KA,VA 5: K,V 5 5: K,V KD,VD 0 0: K,V 0: K,V 0: K,V KA,VA Page Slice KF,VF 6: K,V 6: K,V 6 1 1: K,V 1 1: K,V KD,VD Block KE,VE 7: K,V 7: K,V 7: K,V 2 0: K,V 0: K,V 0: K,V KF,VF 2 2: K,V 2: K,V KC,VC 3 1: K,V 1 1: K,V KE,VE 3: K,V 3: K,V 3 KB,VB 4 KC,VC 2 2: K,V 2: K,V 4 4: K,V 4: K,V 5 KB,VB 3: K,V 3: K,V 3 5: K,V 5 5: K,V 6 6: K,V 6: K,V 6 7 Incarnation 7: K,V 7: K,V 7: K,V Read/write when updating In-memoryincarnation

Evaluation: throughput vs. overhead See paper for theoretical analysis 128GBCrucial M4 8B key8B value ↑15% ↑2.8x ↑6.6x ↓12% 50% insert50% lookup 2.26Ghz4-core

Evaluation: flexibility • Trade-off memory for throughput 50% insert50% lookup Use multiple SSDs for even ↓ memory use and ↑ throughput

Evaluation: generality Memory (bytes/entry) • Workload may change Constantly low! CPU utilization (%) Decreasing!

Summary • Present design practices for low cost and high performance SSD-based indexes • Introduce slicing to co-locate related entries and leverage multiple levels of SSD parallelism • SliceHash achieves 69K lookups/sec (≈12% better than prior works), with consistently low memory (0.6B/entry) and CPU (12%) overhead

Evaluation: theoretical analysis • Parameters • 16B key/value pairs • 80% table utilization • 32 incarnations • 4GB of memory • 128GB SSD • 0.31ms to read a block • 0.83ms to write a block • 0.15ms to read a page overhead 0.6 B/entry cost avg: ≈5.7μsworst: 1.14ms cost avg & worst: 0.15ms

Evaluation: theoretical analysis BufferHash 4B/entry avg: ≈0.2us worst: 0.83ms avg: ≈0.15ms worst: 4.8ms overhead 0.6 B/entry cost avg: ≈5.7μsworst: 1.14ms cost avg & worst: 0.15ms

Design Patterns for Tunable and Efficient SSD-based Indexes

Design Patterns for Tunable and Efficient SSD-based Indexes

Presentation Transcript

Design Tradeoffs for SSD Performance

Design Patterns for Efficient Graph Algorithms in MapReduce

Trie Indexes for Efficient XML Query Processing

Design Patterns for Efficient Graph Algorithms in MapReduce

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes

Hash-based Indexes

Hash-Based Indexes

An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns

Hash-Based Indexes

Efficient Methodologies for Reliability Based Design Optimization

Evolutionary Patterns of Design and Design Patterns

Hash-Based Indexes

SSD (Flash-Based)

MySQL and SSD: Usage Patterns

Hash-Based Indexes

Hash-Based Indexes

Hash-Based Indexes