280 likes | 457 Views
Building a Distributed Database with Device Served Leases or, Distributed ARIES. Ohad Rodeh. Presentation Structure. Motivation A single node database, DBs The database uses object-disks instead of regular disks A distributed database, DBm Based on DBs Summary Acknowledgments. Motivation.
E N D
Building a Distributed Database with Device Served Leasesor, Distributed ARIES Ohad Rodeh
Presentation Structure • Motivation • A single node database, DBs • The database uses object-disks instead of regular disks • A distributed database, DBm • Based on DBs • Summary • Acknowledgments
Motivation • Object-Disks (OSDs) are a novel storage appliance • They allow adding new functionality to disks • Clustered databases have been built using group-services • Adding leases to OSDs allows constructing database clusters without group-services • Group services limit scalability and complicate programming • Who is alive? • Who should be fenced out? • A fencing mechanism is needed
Network A Clustered Database • Shared everything • The disks are shared • Shared nothing • Disks are local to database servers • DB2 uses (mostly) shared nothing • The mainframe version uses shared-disks • Oracle uses shared disks • This paper is focused on shared disks client Database Compute nodes Object disks
Network Building it the Old Way • A GCS connects the compute-nodes • If a compute-node is declared dead it is fenced out • Fencing is supported by the switch • Result: complexity • To build a clustered database one needed to sow together • GCS • Database • Switch client Database Compute nodes GCS Fencing inside switch disks
An Object Disk • An object disk is • An appliance connected to the network • Talks a standard protocol • Implements a flat file-system • SNIA has a working group on standardization • Participating companies: • Panasas, IBM, HP, Veritas, Seagate, …
A Single Node Database, DBs • Mapping tables to objects • A database table is realized as an object on an OSD • ARIES • DBs is assumed to use ARIES • Each journal entry refers to a single page • Locking • Transactions can get into deadlocks • Deadlock detection is used • After detection the database chooses a victim and aborts it.
Distributed Locking I • DBm is based on DBs • DBm requires distributed locking • Lease support in the OSD • Each OSD provides a major-lease • The lease is valid for, say, 30 seconds • The holder of the major-lease can perform operations on the OSD • The major-lease can be delegated
Locking for records, pages, tables • A table is composed of records • A database provides locking on a record basis • Here we assume distributed locking is done per page • Internally to a node per-record locking is provided
Using the OSD Lease • Per OSD a lock server is ran on a compute-node • For OSD X it is XLKM • XLKM takes the major lease for X and provides page-level/object level locking • Requests with outdated leases are rejected XLKM L {Read(OID), L} L=Lease(30) OSD X
An OSD lock-server • Locks are hardened to an object on X, Xlocks • If the compute-node fails, the locks are recoverable XLKM L {Read(OID), L} L=Lease(30) Xlocks OSD X
Connecting to a lock-manager • Compute-nodes connect to XLKM • Can take and release locks on pages and tables on X • XLKM gives the client the major-lease for X • This allows the client direct access to the OSD
Connecting to a lock-manager II • The client takes a lease on XLKM • The lease protects locks taken • If the lease is not renewed in time, the locks are broken • The client provides the location of its log to XLKM • When the lease is broken, the locks are revoked and the pages are marked to-recover • The next client to take a lock on a to-recover page is provided with the log and needs to perform recovery
Deadlocks • Deadlocks can happen in DBm • For local deadlocks the DBs algorithm is sufficient • For distributed deadlocks there is known literature • For example: • Once in a while each compute node requests the set of locks from other compute nodes • Search for cycles • For each cycle kill a victim transaction
4 4 20 10 42 15 40 5 45 8 16 49 20 40 4 10 Tables are implemented as B-trees • Tables are implemented as B+-trees • The table is physically allocated on an OSD object • The internal nodes contain keys • The leaf nodes contains keys and data • Each node is represented as an 8K page • Each page (and key) can be locked separately 60 49 25 20
Transactions on DBm I • Each client has a log object • If A is a compute-node then logA is the log • logA contains the write-ahead log for A • Normally, each node accesses only its own log • If A fails another node will recover logA
Transactions on DBm II • Take locks on pages • From appropriate lock-manager • Write open-transaction to logA • Add log-records and modify pages in memory • Write close-transaction to logA • Release locks • Modified pages need to be written to disk prior to releasing locks • A node can do write-back caching as long as other nodes do not request a page
Example Transaction • Assume table T contains • keys K1 and K2 • Data D1 and D2 respectively • Node A • Wishes to switch between values D1 and D2 • Node A • Take read-locks for K1 and K2 • Reads values D1 and D2 • Takes write-locks for K1 and K2 • Modifies value of K1 to D2 • Adds a log entry • Modifies value of K2 to D1 • Adds a log entry • Releases locks
DBm: Rollback • Basically, DBm uses the DBs solution • Assume node A is performing transaction T • Initially A holds a lock on logA and a set of pages • To rollback T, A needs to • Perform the set of log-entries in undo mode and add a CLR to logA for each modification • Read the pages from disk, modify, write back to disk • Release all locks • Since A initially holds the set of locks, no deadlocks
Recovery: Lease Expiration • Node A loses the lease to a lock-manager on OSD X • logA is on X • A breaks all connections to lock-managers • All A’s pages are marked to-recover in XLKM • Full recovery is needed • logA is not on X • A reconnects to XLKM • If B attempts to lock page P, it sees to-recover • B attempts to lock logA, fails, and releases the lock on P
Recovery: Compute Node Failure • Scenario: node A fails and recovers • LogA needs to be replayed • Recovery is done ARIES style • Take exclusive lock on logA • Perform redo scan, then undo scan • A log entry E that applies to record R in page P is replayed by the following sequence: • Take lock on P • Check if PLSN is lower than ELSN • If so than apply update • Node B can recover logA if A does not recover
Network Applications client client • Possible applications: • Databases with low sharing • Search-mostly databases • Storage-Tank meta-data? MDS cluster Directory structure Object disks S3 Log(S1) Log(S2) Log(S3) S1 S2 S1 S2 S3
Summary • We have shown a method to construct clustered databases without group-services • Pros • Good scalability • Good performance on low-sharing workloads • Cons • Bad performance on high-sharing workloads
Acknowledgments • Avi Teperman • Gary Valentin • Effi Offer • Mark Hayden
Log Sequence Number (LSN) • Each page is stamped with an LSN • LSNs are monotonically increasing and chosen by the log-manager component • Node A performs • Take read-locks for K1 and K2 • Read values D1 and D2 • Take write-locks for K1 and K2 • Modify value of K1 to D2 • Add a log entry to LogA with LSN=8 • The page where K1 is located is marked with LSN=8 • Modify value of K2 to D1 • Add a log entry to LogA with LSN=11 • The page where K2 is located is marked with LSN=11 • Release locks
Synchronizing LSNs • There needs to be a way to synchronize the LSNs, this is required for ARIES to work • Example • Node A takes write-lock for page P • Node A modifies P and marks it with LSN 10 • Node A writes P to disk and releases the lock • Node B takes the write-lock for P • Node B modifies P and marks it with LSN 6 • Node B writes P to disk and releases the lock • After this sequence • If A modifies P and fails • During recovery log entry with LSN 10 will be redone twice.
LSN Solution • When reading a page from disk a node will update its maximal LSN to the maximum between its current LSN and the page LSN • This ensures a monotonically increasing LSN per page. • This is sufficient. • It is relatively cheap as no cluster-wide synchronization is needed