1 / 38

Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage. Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT). Let’s build a fast key-value store. KV store systems are important Google Bigtable , Amazon Dynamo, Yahoo! PNUTS Single-server KV performance matters Reduce cost

leora
Download Presentation

Cache Craftiness for Fast Multicore Key-Value Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

  2. Let’s build a fast key-value store • KV store systems are important • Google Bigtable, Amazon Dynamo, Yahoo! PNUTS • Single-server KV performance matters • Reduce cost • Easier management • Goal: fast KV store for single multi-core server • Assume all data fits in memory • Redis, VoltDB

  3. Feature wish list • Clients send queries over network • Persist data across crashes • Range query • Perform well on various workloads • Including hard ones!

  4. Hard workloads • Skewed key popularity • Hard! (Load imbalance) • Small key-value pairs • Hard! • Many puts • Hard! • Arbitrary keys • String (e.g. www.wikipedia.org/...) or integer • Hard!

  5. First try: fast binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) • Network/disk not bottlenecks • High-BW NIC • Multiple disks • 3.7 million queries/second! • Better? • What bottleneck remains? • DRAM!

  6. Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores Throughput (req/sec, millions) Cache-craftiness: careful use of cache and memory

  7. Contributions • Masstree achieves millions of queries per second across various hard workloads • Skewed key popularity • Various read/write ratios • Variable relatively long keys • Data >> on-chip cache • New ideas • Trie of B+ trees, permuter, etc. • Full system • New ideas + best practices (network, disk, etc.)

  8. Experiment environment • A 16-core server • three active DRAM nodes • Single 10Gb Network Interface Card (NIC) • Four SSDs • 64 GB DRAM • A cluster of load generators

  9. Potential bottlenecks in Masstree Network DRAM … … log log Disk Single multi-core server

  10. NIC bottleneck can be avoided • Single 10Gb NIC • Multiple queue, scale to many cores • Target: 100B KV pair => 10M/req/sec • Use network stack efficiently • Pipeline requests • Avoid copying cost

  11. Disk bottleneck can be avoided • 10M/puts/sec => 1GB logs/sec! • Single disk • Multiple disks: split log • See paper for details Single multi-core server

  12. DRAM bottleneck – hard to avoid 140M short KV, put-only, @16 cores • Cache-craftiness goes 1.5X • father, including the cost of: • Network • Disk Throughput (req/sec, millions)

  13. DRAM bottleneck – w/o network/disk 140M short KV, put-only, @16 cores Throughput (req/sec, millions) Cache-craftiness goes 1.7X father!

  14. DRAM latency – binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) B A C serial DRAM latencies! Y Z X … 2.7 us/lookup 380K lookups/core/sec 10M keys => VoltDB

  15. DRAM latency – Lock-free 4-way tree • Concurrency: same as binary tree • One cache line per node => 3 KV / 4 children X Y Z ½ levels as binary tree ½ DRAM latencies as binary tree A B … … …

  16. 4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores Throughput (req/sec, millions)

  17. 4-tree may perform terribly! • Unbalanced: serial DRAM latencies • e.g. sequential inserts • Want balanced tree w/ wide fanout A B C D E F O(N) levels! G H I …

  18. B+tree – Wide and balanced • Balanced! • Concurrent main memory B+tree [OLFIT] • Optimistic concurrency control: version technique • Lookup/scan is lock-free • Puts hold ≤ 3 per-node locks

  19. Wide fanoutB+tree is 11% slower! 140M short KV, put-only • Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree • 4-tree: each internal node is full • B+tree: nodes are ~75% full • Serial DRAM latencies >= 4-tree Throughput (req/sec, millions)

  20. B+tree– Software prefetch • Same as [pB+-trees] • Masstree: B+tree w/ fanout 15 => 4 cache lines • Always prefetch whole node when accessed • Result: one DRAM latency per node vs. 2, 3, or 4 4 lines = 1 line

  21. B+tree with prefetch 140M short KV, put-only, @16 cores Beats 4-tree by 9% Balanced beats unbalanced! Throughput (req/sec, millions)

  22. Concurrent B+tree problem • Lookups retry in case of a concurrent insert • Lock-free 4-tree: not a problem • keys do not move around • but unbalanced insert(B) Intermediate state! A C D A C D A B C D

  23. B+tree optimization - Permuter • Keys stored unsorted, define order in tree nodes • A concurrent lookup does not need to retry • Lookup uses permuterto search keys • Insert appears atomic to lookups Permuter: 64-bit integer insert(B) A C D A C D B A C D B … 0 1 2 … 0 3 1 2

  24. B+tree with permuter 140M short KV, put-only, @16 cores Improve by 4% Throughput (req/sec, millions)

  25. Performance drops dramatically when key length increases Short values, 50% updates, @16 cores, no logging Throughput (req/sec, millions) • Why? Stores key suffix indirectly, thus each key comparison • compares full key • extra DRAM fetch Keys differ in last 8B Key length

  26. Masstree – Trie of B+trees • Trie: a tree where each level is indexed by fixed-length key fragment • Masstree: a trie with fanout 264, but each trie node is a B+tree • Compress key prefixes! … B+tree, indexed by k[0:7] … B+tree, indexed by k[8:15] B+tree, indexed by k[16:23]

  27. Case Study: Keys share P byte prefix – Better than single B+tree • trie levels • each has one node only A single B+tree with 8B keys …

  28. Masstree performs better for long keys with prefixes Short values, 50% updates, @16 cores, no logging Throughput (req/sec, millions) 8B key comparison vs. full key comparison Key length

  29. Does trie of B+trees hurt short key performance? 140M short KV, put-only, @16 cores 8% faster! More efficient code – internal node handle 8B keys only Throughput (req/sec, millions)

  30. Evaluation • Masstree compare to other systems? • Masstree compare to partitioned trees? • How much do we pay for handling skewed workloads? • Masstree compare with hash table? • How much do we pay for supporting range queries? • Masstreescale on many cores?

  31. Masstree performs well even with persistence and range queries 20M short KV, uniform dist., read-only, @16 cores, w/ network Memcached: not persistent and no range queries Throughput (req/sec, millions) Redis: no range queries Unfair: both have a richer data and query model 0.04 0.22

  32. Multi-core – Partition among cores? • Multiple instances, one unique set of keys per inst. • Memcached, Redis, VoltDB • Masstree: a single shared tree • each core can access all keys • reduced imbalance B Y A X C Z B A C Y X Z

  33. A single Masstree performs better for skewed workloads 140M short KV, read-only, @16 cores, w/ network Throughput (req/sec, millions) No remote DRAM access No concurrency control Partition: 80% idle time 1 partition: 40% 15 partitions: 4% One partition receives δtimes more queries δ

  34. Cost of supporting range queries • Without range query? One can use hash table • No resize cost: pre-allocate a large hash table • Lock-free: update with cmpxchg • Only support 8B keys: efficient code • 30% full, each lookup = 1.1 hash probes • Measured in the Masstree framework • 2.5X the throughput of Masstree • Range query costs 2.5X in performance

  35. Scale to 12X on 16 cores Short KV, w/o logging Perfect scalability Throughput (req/sec/core, millions) • Scale to 12X • Put scales similarly • Limited by the shared memory system Number of cores

  36. Related work • [OLFIT]: Optimistic Concurrency Control • [pB+-trees]: B+treewith software prefetch • [pkB-tree]: store fixed # of diff. bits inline • [PALM]: lock-free B+tree, 2.3X as [OLFIT] • Masstree: first system combines them together, w/ new optimizations • Trie of B+trees, permuter

  37. Summary • Masstree: a general-purposehigh-performance persistent KV store • 5.8 million puts/sec, 8 million gets/sec • More comparisons with other systems in paper • Using cache-craftiness improves performance by 1.5X

  38. Thank you!

More Related