210 likes | 230 Views
Explore a parallel implementation of cache-oblivious B-trees for disk-resident data employing transactional memory. Learn about challenges, potential pitfalls, and the solution using Libxac, a page-based transactional memory system. Discover the implementation sketch, practicality, and performance experiment towards achieving efficiency.
E N D
Concurrent Cache-Oblivious B-trees Using Transactional Memory Jim Sukha Bradley Kuszmaul MIT CSAIL June 10, 2006
Thought Experiment Imagine that, one day, you are assigned the following task: Enclosed is code for a serial, cache-oblivious B-tree. We want a reasonably efficient parallel implementation that works for disk-resident data. Attach:COB-tree.tar.gz PS. We want to be able to restore the data to a consistent state after a crash too. PPS. Our deadline is next week. Good luck!
Concurrent COB-tree? Question: How can one program a concurrent, cache-oblivious B-tree? Approach: We employ transactional memory. What complications does I/O introduce?
Potential Pitfalls Involving I/O Suppose our data structure resides on disk. • We might need to make explicit I/O calls to transfer blocks between memory and disk. But a cache-oblivious algorithm doesn’t know the block size B! • We might need buffer management code if the data doesn’t fit into main memory. • We might need to unroll I/O if we abort a transaction that has already written to disk.
Our Solution: Libxac • We have implemented Libxac, a page-based transactional memory system that operates on disk-resident data. Libxac supports ACID transactions on a memory-mapped file. • Using Libxac, we are able to implement a complex data structure that operates on disk-resident data, e.g. a cache-oblivious B-tree.
Libxac Handles Transaction I/O • We might need to make explicit I/O calls to transfer blocks between memory and disk. Similar to mmap, Libxac provides a function xMmap. Thus, we can operate on disk-resident data without knowing block size. • We might need buffer management code if the data doesn’t fit into main memory. Like mmap, the OS automatically buffers pages in memory. • We might need to unroll I/O if we abort a transaction that has already written to disk. Since Libxac implements multiversion concurrency control, we still have the original version of a page even if a transaction aborts.
Outline • Programming with Libxac • Cache-Oblivious B-trees
Runtime initialization function. For durable transactions, logs are stored in the specified directory.* Transactionally maps the first page of the input file. Transaction body. The body can be a complex function (e.g., a cache-oblivious B-tree insert!). Unmap the region. Shutdown runtime. * Currently Libxac logs the transaction commits, but we haven’t implemented the recovery program yet. Example Program with Libxac int main(void) { int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096); while (status != SUCCESS) { xbegin(); x[0] ++; status = xend(); } xMunmap(x); xShutdown(); return 0; }
Libxac Memory Model • Aborted transactions are visible to the programmer (thus, programmer must explicitly retry transaction). Control flow always proceeds from xbegin() to xend(). Thus, the xaction body can contain system/library calls. • At xend(), all changes to xMmap’ed region are discarded on FAILURE, or committed on SUCCESS. • Aborted transactions always see consistent state. Read-only transactions can always succeed. int main(void) { int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096); while (status != SUCCESS) { xbegin(); x[0] ++; status = xend(); } xMunmap(x); xShutdown(); return 0; } *Libxac supports concurrent transactions on multiple processes, not threads.
Implementation Sketch • Libxac detects memory accesses by using a SIGSEGV handler to catch a memory protection violation on a page that has been mmap’ed. • This mechanism is slow for normal transactions: • Time for mmap, SIGSEGV handler: ~ 10 ms • Efficient if we must perform disk I/O to log transaction commits. • Time to access disk: ~ 10 ms
Is xMmap practical? Experiment on a 4-proc. AMD Opteron, performing 100,000 insertions of elements with random keys into a B-tree. Each insert is a separate transaction. Libxac and BDB both implement group commit. B-tree and COB-tree both use Libxac. Note that none of the three data structures have been properly tuned. Conclusion: We should achieve good performance.
Outline • Programming with Libxac • Cache-Oblivious B-trees
What is a Cache-Oblivious B-tree? • A cache-oblivious B-tree (e.g. [BDFC00]) is a dynamic dictionary data structure that supports searches, insertions/deletions, and range-queries. • An cache-oblivious algorithm/data structure does not know system parameters (e.g. the block size B.) • Theorem [FLPR99]: a cache-oblivious algorithm that is optimal for a two-level memory hierarchy is also optimal for a multi-level hierarchy.
31 1 -- 56 70 -- 54 39 13 -- 6 23 21 10 -- -- -- -- 38 59 83 48 45 40 4 24 -- 16 -- 7 15 -- Cache-Oblivious B-Tree Example Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 Packed Memory Array (PMA) • The COB-tree can be divided into two pieces: • A packed memory array that stores the data in order, but contains gaps. • A static cache-oblivious binary-tree that indexes the packed memory array.
6 -- 39 70 13 54 -- -- -- 56 23 31 21 10 -- 1 -- -- -- 59 83 48 40 45 4 38 -- -- 7 15 16 24 Cache-Oblivious B-Tree Insert Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 To insert a key of 37:
31 1 -- 56 70 -- 54 39 13 -- 6 23 21 10 -- -- -- -- 38 59 83 48 45 40 4 24 -- 16 -- 7 15 -- Cache-Oblivious B-Tree Insert Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 • To insert a key of 37: • Find correct section of PMA location using static tree. 37
31 1 -- 56 70 -- 54 39 13 -- 6 23 21 10 -- -- -- -- 38 59 83 48 45 40 4 24 -- 16 -- 7 15 -- Cache-Oblivious B-Tree Insert Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 • To insert a key of 37: • Find correct section of PMA location using static tree. • Insert into PMA. This step may cause a rebalance of the PMA. 37
6 -- 38 83 13 54 -- 40 45 59 23 31 21 10 -- 1 56 -- -- 70 -- 48 39 -- 4 37 -- -- 7 15 16 24 Cache-Oblivious B-Tree Insert Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 • To insert a key of 37: • Find correct section of PMA location using static tree. • Insert into PMA. This step possibly requires a rebalance. • Fix the static tree.
6 -- 38 83 13 54 -- 40 45 59 23 31 21 10 -- 1 56 -- -- 70 -- 48 39 -- 4 37 -- -- 7 15 16 24 Cache-Oblivious B-Tree Insert Static Cache-Oblivious Tree 21 10 40 4 16 37 56 4 10 16 21 37 40 56 83 • To insert a key of 37: • Find correct section of PMA location using static tree. • Insert into PMA. This step possibly requires a rebalance. • Fix the static tree.
6 -- 38 83 13 54 -- 40 45 59 23 31 21 10 -- 1 56 -- -- 70 -- 48 39 -- 4 37 -- -- 7 15 16 24 Cache-Oblivious B-Tree Insert Static Cache-Oblivious Tree 21 10 40 4 16 37 56 4 10 16 21 37 40 56 83 Insert is a complex operation. If we wanted to use locks, what is the locking protocol? What is the right (cache-oblivious?) lock granularity?
Conclusions A page-based TM system such as Libxac • Represents a good match for disk-resident data structures. • The per-page overheads of TM are small compared to cost of I/O. • Is easy to program with. • Libxac allows us to program a concurrent, disk-resident data structure with ACID properties, as though it was stored in memory.