Highly Available ACID Memory

Highly Available ACID Memory Vijayshankar Raman

Introduction • Why ACID memory? • non-database apps: • want updates to critical data to be atomic and persistent • synchronization useful when multiple threads are accessing critical data • databases • concurrency control and recovery logic runs through most of database code. • Extremely complicated, and hard to get right • bugs lead to data loss -- disastrous!

Project goal • Take recovery logic out of apps • Build a simple user-level library that provides recoverable, transactional memory. • all the logic in one place => easy to debug, maintain • easy to to make use of hardware advances • use replication and persistent memory for recovery -- instead of writing logs • simpler to implement • simpler for applications to use ??

Questions to answer • program simplicity vs. performance • how much do we lose by replicating instead of logging? • on a cluster, can we use replication directly for availability? • traditionally availability handled on top of the recovery system

Outline • Introduction • Acid Memory API • Single Node design & implementation • Evaluation • High Availability: multiple node design and implementation • Evaluation • Conclusion

Acid Memory API • Transaction manager interface • TransactionManager(database name, acid memory area) • Transaction interface • beginTransaction() • getLock(memory region1, READ/WRITE) • getLock(memory region2, READ/WRITE) • ... • memory region = virtual address prefix • commit/abort() -- all locks released • combine concurrency control with recovery • recovery done on write-locked regions • supports fine granularity locking=> cannot use VM for recovery • applications can modify data directly

Implementation Acid memory area master copy Disk file • assume non-volatile memory (NVRAM, battery backup) • assume persistent file cache • acid memory area mmap’d from file • persistence => writes are permanent • getLock(WRITE) -- copy the region onto mirror area • transaction abort / system crash • undo changes on all writelocked regions using copy in mirror area • only overhead of recovery is a memcpy on each write lock mmap mirror

Evaluation • Overhead of acid memory • read lock:  35usec (lock manager overhead) • write lock:  35usec + 5.5usec/KB (memcpy cost) • much lesser than methods that write log to disk • Ease of programming • application needs to only acquire locks to become recoverable • can manipulate the data directly -- do not have to call special function on every update

Example: suppose I want to transfer 1M $ from A’s account to B’s With ACID memory /* a points to A’s account */ /* b points to B’s account */ trans = new Transaction(transMgr); trans->getLock(a, WRITE); trans->getLock(b, WRITE); a = a - 1000000; b = b + 1000000; trans->commit(); (Update() creates the needed logs) Using logging BeginTransaction(); getLock(A’s account, WRITE); getLock(B’s account, WRITE); read(A’s account, a); read(B’s account, b); a = a - 1000000; b = b + 1000000; Update(A’s account, a); Update(B’s account, b); commit();

Acid memory: write-lock the data-structure Logging: write-lock the structure and update each integer separately • Performance comparison: acid memory vs. logging • consider a transaction updating integers in a 1KB data-structure • logging each individual update is a bit faster, to an extent • acid memory gives okay performance with very easy programmability Time (in microseconds) Number of integer writes

Outline • Introduction • Acid Memory API • Single Node design & implementation • Evaluation • High Availability: multiple node design and implementation • Evaluation • Conclusion

Replication for availability Transaction processing monitor replicate • traditionally, availability has been handled in a separate layer -- above recovery • can we handle both recovery and availability via same mechanism? DBMS DBMS DBMS

lock manager Architecture replicas Owner data data data data data client • Transactions run by transaction handler • all lock requests must go to owner • data in all replicas must be kept in sync • balance load by partitioning data • different owner for each partition • failure model • fail-stop: nodes never send incorrect messages to others • failed nodes never recover data after crash • network never fails Transaction handler

Owner data lock manager data data data data client Transaction handler • Reads: client gets data from random replica • Writes: must update all replicas • on commit, transaction sends new data to owner • owner propagates update atomically to all replicas • 3 phase non-blocking commit protocol. Always ensure that there is someone to take over the propagation if you crash • if owner crashes, fail-over to a replica

Evaluation • Very fast recovery -- 424 usecs • get fast transactions without non-volatile memory • writes are slower • 4n messages at commit if n replicas • still, this is faster than logging to disk • homogeneous software: susceptible to bugs

Conclusions • Acid memory easier to use • Performance relative to logging not too bad • replication gives fast recovery • Using cache for replication • when/how much to replicate? Future Work

Additional Slides

Evaluation, w.r.t. logging based approach • Ease of implementation • very little to code, mostly lock manager stuff • whereas in a traditional dbms • specialized buffer manager • log manager • complex recovery mechanism

How to make file cache persistent • Rio (Chen et. Al, 1996) • place file cache in non-volatile memory • protect it against OS crashes using VM protection • flush pages in file cache to disk files on reboot

Highly Available ACID Memory

Highly Available ACID Memory

Presentation Transcript

StarFish: highly-available block storage

Dynamo: Amazon's Highly Available Key-value Store

Dynamo: Amazon’s Highly Available Key-value Store

StarFish: highly-available block storage

Highly Available Oracle, The Unknown Details

StarFish : highly-available block storage

Dynamo: Amazon’s Highly Available Key-Value Store

StarFish : highly-available block storage

Highly available services

Highly Superior Autobiographical Memory “HSAM”

SCALABLE EVOLUTION OF HIGHLY AVAILABLE SYSTEMS

Building Highly Available Wireless Infrastructures

Building Highly Available Web Applications

Highly Available, Highly Scalable – Enterprise Manager 12c for Large Enterprises

Highly Specialized Care…Available Locally

Managing Highly Available Computing Labs

Enabling Highly Available Grid Sites

Dynamo: Amazon’s Highly Available Key-value Store

Highly Effective Acid Reflux Remedies

Enabling Highly Available Grid Sites

Towards Highly Available OSG (Open Science Grid)