480 likes | 539 Views
LightStore : Software-defined Network-attached Key-value Drives. Chanwoo Chung , Jinhyung Koo * , Junsu Im * , Arvind, and Sungjin Lee * Massachusetts Institute of Technology (MIT) *Daegu Gyeongbuk Institute of Science & Technology (DGIST).
E N D
LightStore: Software-defined Network-attached Key-value Drives Chanwoo Chung, Jinhyung Koo*, JunsuIm*, Arvind, and Sungjin Lee* Massachusetts Institute of Technology (MIT) *Daegu Gyeongbuk Institute of Science & Technology (DGIST) The 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2019) Providence, RI
Datacenter Storage Systems • This talk presents a new storage architecture, which is 2.0-5.2x power-and 2.3-8.1x floor space-efficient for flash … Application 2 Application N Application 3 Application 1 Application Servers Datacenter Network (e.g., 10G/40G Ethernet, InfiniBand) SQL access File access KV access Storage Services (e.g., SQL, NFS, RADOS ) Storage Nodes XeonCPUs Large DRAM SSD SSD SSD SSD Significant capital & operating cost!
How do we achieve cost reduction? • One SSD per network port • KV interface by embedded-class storage nodes • Adapters in app servers … Application 2 Application N Application 3 Application 1 Application Servers Interoperability Adapter Adapter Datacenter Network (e.g., 10G/40G Ethernet, InfiniBand) Bottleneck!!! embedded nodes KV KV KV KV Storage Services (e.g., SQL, NFS, RADOS ) SSD SSD SSD Storage Nodes SSD XeonCPUs Large DRAM SSD SSD SSD SSD Current Architecture SSD Bandwidth: 2~10GB/s (>10GbE)
Which KVS on embedded systems? Hash-based KVS LSM-tree-based KVS LightStore • Simple implementation • Unordered keys • limited RANGE & SCAN • Random==Sequential access • Unbound tail-latency • KV-SSDs (mounted on host) • Samsung KV-SSD • KAML [Jin et. al., HPCA 2017] • Multi-level search tree • Sorted keys • RANGE & SCAN • Fast sequential access Adapter-friendly • Bounded tail-latency • Append-only batched writes Flash-friendly
Performance on ARM? • RocksDB on 4-core ARM + Samsung 960PRO SSD • Excessive memcpy overhead • High locking / context-switching overhead • RocksDB running on a filesystem – Deep I/O stack S- : Sequential R- : Random 3.6x-4.2x slow down! Utilizing only 10% of read bandwidth
Our plan: software optimization • System optimization (1) specialized memory allocator: minimize data copies (2) lock-free queues between threads instead of locks • LSM-tree-specific optimization (1) decouple keys from KV pairs (keytable) for faster compaction (2) bloom filters and caching for keytables Wrote the entire software from scratch to run it on an embedded system
SSD controller • Typically Flash Translation Layer (FTL) requires an embedded-class multicore + a few GB DRAM • We implemented the FTL in HW; and • Used the freed multicore and DRAM to implement Key-value Store & Network manager Network Interface Card (10Gb Ethernet) Host Interface Controller (SATA, PCIe) DRAM (>4 GB) DRAM (>4 GB) ARM core (~1GHz) ARM core (~1GHz) ARM core (~1GHz) ARM core (~1GHz) Flash Management ARM core (~1GHz) ARM core (~1GHz) ARM core (~1GHz) ARM core (~1GHz) Thanks to LSM-Tree Vendor-specific Accelerators HW FTL NAND I/O Controller NAND I/O Controller NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND … …
LightStore Overview Clients (Datacenter Applications) SQL File System KVS Block Application Servers fwrite() get() read() INSERT SQL Adapter FS Adapter Blk Adapter KV Request (GET, SET, DELETE, …) KV requests hashed to different nodes by adapters w/ Consistent Hashing • Datacenter Network NIC NIC NIC KV Store KV Store … KV Store … LightStore Cluster (Storage Pool) Flash Flash Flash Expansion Card Network LightStore Node (SSD-sized Drive) Exp. Net Flash
Introduction LightStore Software KVS Performance LightStore HW FTL Applications and Adapters Conclusion
LightStore software LightStore-Engine Datacenter Network KV Protocol Server Thread #1 Thread #2 KV Request Handler KV ReplyHandler Lock-free Queues LSM-Tree Engine Minimize Copy! Thread #3 Thread #4 Memtable LSM-Tree Manager Writer & Compaction Value read from flash READ/WRITE/TRIM READ Thread #5 Poller Zero-Copy Memory Allocator Direct-IO Engine Userspace poll () Kernel Device Ctrl. Interrupt Handler Interrupt Hardware LightStore HW FTL
LightStore software LightStore-Engine Datacenter Network KV Protocol Server Supported RESP commands: SET, MSET, GET, MGET, DELETE, SCAN Thread #1 Thread #2 KV Request Handler KV ReplyHandler Lock-free Queues LSM-Tree Engine Thread #3 Thread #4 Memtable LSM-Tree Manager Writer & Compaction Value read from flash READ/WRITE/TRIM READ Thread #5 Poller Zero-Copy Memory Allocator Direct-IO Engine Userspace poll () Kernel Device Ctrl. Interrupt Handler Interrupt Hardware LightStore HW FTL
LightStore software LightStore-Engine Datacenter Network KV Protocol Server Thread #1 Thread #2 KV Request Handler KV ReplyHandler Lock-free Queues LSM-Tree Engine Thread #3 Thread #4 Memtable LSM-Tree Manager Writer & Compaction Value read from flash READ/WRITE/TRIM READ Thread #5 Poller Zero-Copy Memory Allocator Direct-IO Engine Userspace poll () Kernel Device Ctrl. Interrupt Handler Interrupt Hardware LightStore HW FTL
LSM-tree Basics Memtable (L0) KV Writes Flush on threshold/timer Memory Storage L1 SST L1 SST L1 SST … L1 L2 Merge multiple L1 SSTs Compaction: Merge L1 SSTs & Write L2 SST HUGE OVERHEAD L2 SST Sorted String Table (SST), immutable Key Val Key Val Key Val
LightStore LSM-tree Engine Cached Keytables Memtable (L0) KV Writes …. L2 Keytable Decouple Key and Value! Memory [Lu, WiscKey, FAST 2016] values sorted keys Storage L1 Keytable L1 Keytable L1 Keytable L1 keytables L2 keytables Val Compaction on keys L2 Keytable Bloom-filter per keytable Leveled keytables Persistent value-tables
Introduction LightStore Software KVS Performance LightStore HW FTL Applications and Adapters Conclusion
LightStore Prototype • Each LightStore Prototype node is implemented using a Xilinx ZCU102 evaluation board and a custom flash card Xilinx ZCU102 4GB DRAM Custom Flash Card Expansion Card Connectors Artix7 FPGA Raw NAND Flash chips (512GB) ZynqUltrascale+ SoC (Quad-core ARM Cortex-A53 with FPGA)
KVS Experimental Setup • Clients and storage nodes are connected to the same 10GbE switch
Experimental KVS Workloads • 5 synthetic workloads to evaluate KVS performance • YCSB for real-world workloads (in the paper) * The value size of 8-KB used to match the flash page size
KVS Throughput (Local) • Throughput seen locally, i.e., w/o network; SSD Read BW (3.2 GB/s) Metadata fetching Metadata fetching Tree traversing SSD Write BW LightStore Flash Read BW Search overhead LightStore Flash Write BW Compaction! (10% Write) sequential random Almost saturating device BW
KVS Throughput (Local) • Throughput seen locally, i.e., w/o network; SSD Read BW (3.2 GB/s) Metadata fetching Metadata fetching Tree traversing SSD Write BW LightStore Flash Read BW Search overhead LightStore Flash Write BW LightStore w/ flash as fast as x86 Compaction! (10% Write) sequential random
KVS Throughput (Local) • Throughput seen locally, i.e., w/o network; SSD Read BW (3.2 GB/s) SSD Write BW LightStore Flash Read BW 20% faster LightStore core LightStore Flash Write BW LightStore w/ flash as fast as x86 (10% Write) sequential random
KVS Throughput (Network) • Throughput seen by clients over the network x86 Ethernet BW LightStore Flash Read BW LightStore Ethernet BW Network + Compaction LightStore Flash Write BW LightStore Flash Write BW Almost saturating device BW Almost saturating network BW
KVS Throughput (Network) • Throughput seen by clients over the network • Given the same flash/NIC, LightStore outperforms! x86 Ethernet BW LightStore w/ NIC as x86 20%+ core LightStore w/ NIC/flash same as x86 Almost saturating device BW Almost saturating network BW
KVS Scalability • Random reads (R-GET) x86 Ethernet BW Network is bottleneck (as expected) Linear ! x86 Ethernet BW LightStore x86-ARDB with 2 NICs x86-ARDB with 1 NIC
KVS IOPS-per-Watt • Assume that x86-ARDB scales with up to 4 SSDs • 4 times the performance seen previously • Peak power • x86-ARDB – 400W, LightStore-Prototype – 25W
Introduction LightStore Software KVS Performance LightStore HW FTL Applications and Adapters Conclusion
HW FTL and LSM-tree LSM-Tree Compaction Always Append Data
LightStore HW FTL: data structure Application-Managed Flash (AMF) [FAST 2016] Segment mapping table • coarse-grained mapping translation Block management table • wear-leveling & bad-block management • Each table is ~1MB per 1TB flash • Commercial SSDs (FTL) require >1GB per 1TB flash • Adds very small latency • 4 cycles (mapped) or 140 cycles (not mapped) • at most 0.7 us @ 200 MHz (<1% of NAND latency)
Effects of HW FTL • HW FTL > Lightweight SW FTL > Full SW FTL • Full SW: page mapping; garbage collection copying overhead • Read: 7-10% degradation • Write: 28-50% degradation • Compaction thread very active; More SW FTL tasks • Without FPGA (or HW FTL), we would need an extra set of cores • (Trade-off between Cost and Design Efforts)
Introduction LightStore Software KVS Performance LightStore HW FTL Applications and Adapters Conclusion
Application: Block and File Stores Block Store File Store Ceph Ethernet BW LightStore Ethernet BW LightStore Flash Write BW • LightStore: Block Adapter implemented in User Mode (BUSE) • File Adapter: filesystem + Block Adapter • Ceph: Ceph-Block & Ceph-filesystem • Ceph known to work much better with large (>1MB) objects
Introduction LightStore Software KVS Performance LightStore HW FTL Applications and Adapters Conclusion
Summary • Current storage servers are costly and prevent applications from exploiting the full performance of SSDs over the network • LightStore • Networked KV-Drives instead of x86-based storage system • Thin software KV adapters on clients • Delivers full NAND flash bandwidth to network • Benefits • 2.0x power- and 2.3x space-efficient (conservative) • Up to 5.2x power- & 8.1x space-efficient • 4-node prototype is 7.4x power-efficient
Thank You ! Chanwoo Chung (cwchung@csail.mit.edu)