Aether : A Scalable Approach to Logging

Databases @ CarnegieMellon Ryan Johnson†‡Ippokratis Pandis†‡ Radu Stoica ‡ Manos Athanassoulis ‡ Anastasia Ailamaki †‡ †Carnegie Mellon University ‡École Polytechnique Fédérale de Lausanne Aether: A Scalable Approach to Logging VLDB 2010

Scalability is key! • Modern hardware needs software parallelism • OLTP is inherently parallel at the request level • Very good on providing high concurrency • But, internal serializations limit execution parallelism Need for scalable OLTP components

Logging is crucial for OLTP (e.g., Amazon outage*) • Fault tolerance • Crash recovery • Transaction abort/rollback • Performance • Log changes for durability (no in-place updates) • Write dirty pages back asynchronously $$$ Need efficient and scalable logging solution • * http://www.datacenterknowledge.com/archives/2010/05/13/car-crash-triggers-amazon-power-outage/

Logging is bottleneck for scalability (1) At commit, must yield for log flush • synchronous I/O at critical path • locks held for long time • two context switches per commit (2) Must insert records to the log buffer • centralized main-memory structure • source of contention • Working around the bottlenecks: • Asynchronous commit • Replace logging with replication and fail-over CPU-2 CPU-N CPU-1 CPU L1 L1 L1 L2 Data Log RAM HDD Workarounds compromise durability

Does “correct” logging have to be so slow? • Locks held for long time • Not actually used during the flush • Indirect way to enforce isolation • Two context switches per commit • Transactions nearly stateless at commit time • Easy to migrate transactions between threads • Log buffer is source of contention • Log orders incoming requests, not threads • Log records can be combined No! Aether: uncompromised, yet scalable logging

Agenda • Logging-related problems • Aether logging • Reducing lock contention • Reducing context switching • Scalable log buffer implementation • Conclusions

Log Mgr. Working Lock Mgr. Bottleneck 1: Amplified lock contention Done! Commit Xct 1 Xct 2 I/O Waiting Other transactions wait for locks while the log flush I/O completes

Early Lock Release in case of a single log • Finish transaction • Release locks before commit • Insert transaction commit record • Wait until log record is flushed • Dependent xct serialized at the log buffer • No extra overhead, idea around for 30 years …but nobody uses it so far… With ELR other transactions do not wait for locks held during log flushes

ELR benefits Sun Niagara T2 (64 HW contexts), 64GB RAM Mem. resident TPC-B in Shore-MT Zipfian distribution on transaction inputs ELR is simple and sometimes very useful

Log Mgr. Working Bottleneck 2: Excessive context switching Sun Niagara T2 (64 HW contexts) Mem. resident TPC-B in Shore-MT Time Commit Xct 1 Xct 2 Context switch I/O Waiting • One context switch per log flush  Pressure on the OS scheduler Must decouple thread scheduling from log flushes

Flush Pipelining • Scheduler in the critical path andwastesCPU • Multi-core HW only amplifies the problem • But, transaction nearly stateless at commit • Detach transaction state from worker thread • Pass it to log writer • Worker threads do not block at commit time Xct 2 Time Thread 1 Xct 1 Thread 2

Flush Pipelining • Scheduler in the critical path andwastesCPU • Multi-core HW only amplifies the problem • But, transaction nearly stateless at commit • Detach transaction state from worker thread • Pass it to log writer • Worker threads do not block at commit time Log Writer Time Thread 1 Xct 1 Xct3 Staged-like mechanism = low scheduling costs Xct 2 Thread 2 Xct4

Impact of Flush Pipelining Sun Niagara T2 (64 HW contexts) Mem. resident TPC-B in Shore-MT Match Asynchronous Commit throughput without compromising durability

Log Mgr. Working Bottleneck 3: Log buffer contention • Centralized log buffer  Contention, which depends on • participating number of threads • size of modifications (kiB in case of physical logging) Time Xct 1 Xct 2 Xct3 I/O Waiting Log Buffer Latch Waiting

Eliminating critical sections • Inspiration: elimination-based backoff* • Critical sections can cancel each other out • E.g., stack push/pop operations push() • Attempt to acquire mutex • If failed, backoff waiting on a array • If someone else already waits there, eliminate requests w/o acquiring mutex pop() push() Stack Station area Adapt elimination-based backoff for db logging • * D. Hendler, N. Shavit, and L. Yerushalmi. “A Scalable Lock-free Stack Algorithm.” In Proc. SPAA, 2004

Accessing the log buffer • Break log insert into three logical steps (a) Reserve space by updating head LSN (b) Copy log record (memcpy) (c) Make insert visible by updating tail LSN, in LSN order • Steps (a) + (c) can be consolidated • Accumulate requests off the critical path • Send only group leader to fight for the critical section • Move (b) out of critical section (c) (a) (b)

Design evolution contention(# threads) = O(1) (B) Baseline (B) Baseline Consolidation array (C) Decouple contention from the # of threads and average log entry size (D) Decoupled buffer insert (D) Decoupled buffer insert Hybrid design (CD) Hybrid design (CD) contention(work) = O(1) Mutex held Start/finish Waiting Copy into buffer

Performance as contention increases Microbenchmark Bimodal distribution 48B and 160B 120B average Hybrid solution combines benefits of both

Sensitivity to slot count 60 Colors/height is throughput (in MB/s) 50 1700 1600 40 1400 30 # Threads 1200 1000 20 800 10 400 0 1 2 3 4 5 6 7 8 9 10 # Slots Relatively insensitive to slot count (3 or 4 slots good enough for most cases)

Case against distributed logging • Distributing TPC-C log records over 8 logs • 1 ms wall time, ~200 in flight transactions, 30 commits • Horizontal blue line = 1 log • Diagonal line = dependency (new = black, older = grey) Large overhead keeping track dependencies and over-flushing

Agenda • Logging-related problems • Aether logging • Reducing context switching • Scalable log buffer implementation • Conclusions

Putting it all together Sun Niagara T2 (64 HW contexts) Mem. Resident, TPC-B Gap increases w/ # threads! +15% +60% from Baseline Eliminate current log bottlenecks Future-proof system against contention

Conclusions • Logging is an essential component for OLTP • Simplifies recovery, improves performance without the need of physically partitioning data .. but need to address all lurking bottlenecks • Aether is a holistic approach to logging • Leverages existing techniques (Early lock release) • Reduces context switches (Flush Pipelining) • Eliminates log contention (Consolidation-based backoff) • Can achieve 2GB/s of log throughput per node Thank you!

Aether : A Scalable Approach to Logging