Building a Distributed Database with Device Served Leases or, Distributed ARIES

Building a Distributed Database with Device Served Leasesor, Distributed ARIES Ohad Rodeh

Presentation Structure • Motivation • A single node database, DBs • The database uses object-disks instead of regular disks • A distributed database, DBm • Based on DBs • Summary • Acknowledgments

Motivation • Object-Disks (OSDs) are a novel storage appliance • They allow adding new functionality to disks • Clustered databases have been built using group-services • Adding leases to OSDs allows constructing database clusters without group-services • Group services limit scalability and complicate programming • Who is alive? • Who should be fenced out? • A fencing mechanism is needed

Network A Clustered Database • Shared everything • The disks are shared • Shared nothing • Disks are local to database servers • DB2 uses (mostly) shared nothing • The mainframe version uses shared-disks • Oracle uses shared disks • This paper is focused on shared disks client Database Compute nodes Object disks

Network Building it the Old Way • A GCS connects the compute-nodes • If a compute-node is declared dead it is fenced out • Fencing is supported by the switch • Result: complexity • To build a clustered database one needed to sow together • GCS • Database • Switch client Database Compute nodes GCS Fencing inside switch disks

An Object Disk • An object disk is • An appliance connected to the network • Talks a standard protocol • Implements a flat file-system • SNIA has a working group on standardization • Participating companies: • Panasas, IBM, HP, Veritas, Seagate, …

A Single Node Database, DBs • Mapping tables to objects • A database table is realized as an object on an OSD • ARIES • DBs is assumed to use ARIES • Each journal entry refers to a single page • Locking • Transactions can get into deadlocks • Deadlock detection is used • After detection the database chooses a victim and aborts it.

Distributed Locking I • DBm is based on DBs • DBm requires distributed locking • Lease support in the OSD • Each OSD provides a major-lease • The lease is valid for, say, 30 seconds • The holder of the major-lease can perform operations on the OSD • The major-lease can be delegated

Locking for records, pages, tables • A table is composed of records • A database provides locking on a record basis • Here we assume distributed locking is done per page • Internally to a node per-record locking is provided

Using the OSD Lease • Per OSD a lock server is ran on a compute-node • For OSD X it is XLKM • XLKM takes the major lease for X and provides page-level/object level locking • Requests with outdated leases are rejected XLKM L {Read(OID), L} L=Lease(30) OSD X

An OSD lock-server • Locks are hardened to an object on X, Xlocks • If the compute-node fails, the locks are recoverable XLKM L {Read(OID), L} L=Lease(30) Xlocks OSD X

Connecting to a lock-manager • Compute-nodes connect to XLKM • Can take and release locks on pages and tables on X • XLKM gives the client the major-lease for X • This allows the client direct access to the OSD

Connecting to a lock-manager II • The client takes a lease on XLKM • The lease protects locks taken • If the lease is not renewed in time, the locks are broken • The client provides the location of its log to XLKM • When the lease is broken, the locks are revoked and the pages are marked to-recover • The next client to take a lock on a to-recover page is provided with the log and needs to perform recovery

Deadlocks • Deadlocks can happen in DBm • For local deadlocks the DBs algorithm is sufficient • For distributed deadlocks there is known literature • For example: • Once in a while each compute node requests the set of locks from other compute nodes • Search for cycles • For each cycle kill a victim transaction

4 4 20 10 42 15 40 5 45 8 16 49 20 40 4 10 Tables are implemented as B-trees • Tables are implemented as B+-trees • The table is physically allocated on an OSD object • The internal nodes contain keys • The leaf nodes contains keys and data • Each node is represented as an 8K page • Each page (and key) can be locked separately 60 49 25 20

Transactions on DBm I • Each client has a log object • If A is a compute-node then logA is the log • logA contains the write-ahead log for A • Normally, each node accesses only its own log • If A fails another node will recover logA

Transactions on DBm II • Take locks on pages • From appropriate lock-manager • Write open-transaction to logA • Add log-records and modify pages in memory • Write close-transaction to logA • Release locks • Modified pages need to be written to disk prior to releasing locks • A node can do write-back caching as long as other nodes do not request a page

Example Transaction • Assume table T contains • keys K1 and K2 • Data D1 and D2 respectively • Node A • Wishes to switch between values D1 and D2 • Node A • Take read-locks for K1 and K2 • Reads values D1 and D2 • Takes write-locks for K1 and K2 • Modifies value of K1 to D2 • Adds a log entry • Modifies value of K2 to D1 • Adds a log entry • Releases locks

DBm: Rollback • Basically, DBm uses the DBs solution • Assume node A is performing transaction T • Initially A holds a lock on logA and a set of pages • To rollback T, A needs to • Perform the set of log-entries in undo mode and add a CLR to logA for each modification • Read the pages from disk, modify, write back to disk • Release all locks • Since A initially holds the set of locks, no deadlocks

Recovery: Lease Expiration • Node A loses the lease to a lock-manager on OSD X • logA is on X • A breaks all connections to lock-managers • All A’s pages are marked to-recover in XLKM • Full recovery is needed • logA is not on X • A reconnects to XLKM • If B attempts to lock page P, it sees to-recover • B attempts to lock logA, fails, and releases the lock on P

Recovery: Compute Node Failure • Scenario: node A fails and recovers • LogA needs to be replayed • Recovery is done ARIES style • Take exclusive lock on logA • Perform redo scan, then undo scan • A log entry E that applies to record R in page P is replayed by the following sequence: • Take lock on P • Check if PLSN is lower than ELSN • If so than apply update • Node B can recover logA if A does not recover

Network Applications client client • Possible applications: • Databases with low sharing • Search-mostly databases • Storage-Tank meta-data? MDS cluster Directory structure Object disks S3 Log(S1) Log(S2) Log(S3) S1 S2 S1 S2 S3

Summary • We have shown a method to construct clustered databases without group-services • Pros • Good scalability • Good performance on low-sharing workloads • Cons • Bad performance on high-sharing workloads

Acknowledgments • Avi Teperman • Gary Valentin • Effi Offer • Mark Hayden

LSNs

Log Sequence Number (LSN) • Each page is stamped with an LSN • LSNs are monotonically increasing and chosen by the log-manager component • Node A performs • Take read-locks for K1 and K2 • Read values D1 and D2 • Take write-locks for K1 and K2 • Modify value of K1 to D2 • Add a log entry to LogA with LSN=8 • The page where K1 is located is marked with LSN=8 • Modify value of K2 to D1 • Add a log entry to LogA with LSN=11 • The page where K2 is located is marked with LSN=11 • Release locks

Synchronizing LSNs • There needs to be a way to synchronize the LSNs, this is required for ARIES to work • Example • Node A takes write-lock for page P • Node A modifies P and marks it with LSN 10 • Node A writes P to disk and releases the lock • Node B takes the write-lock for P • Node B modifies P and marks it with LSN 6 • Node B writes P to disk and releases the lock • After this sequence • If A modifies P and fails • During recovery log entry with LSN 10 will be redone twice.

LSN Solution • When reading a page from disk a node will update its maximal LSN to the maximum between its current LSN and the page LSN • This ensures a monotonically increasing LSN per page. • This is sufficient. • It is relatively cheap as no cluster-wide synchronization is needed

Building a Distributed Database with Device Served Leases or, Distributed ARIES

Building a Distributed Database with Device Served Leases or, Distributed ARIES

Presentation Transcript

CS4404 Distributed Database

Distributed Database

Distributed Database Systems

Distributed Database Systems

Distributed Database Applications

Distributed Database Applications

Distributed Database Systems

Distributed Database

Massively Distributed Database Systems - Distributed DBS

Distributed Database System

Distributed Database

A Distributed Database System

Distributed Database Applications

Distributed Database Security

Distributed Database Systems

DISTRIBUTED DATABASE ARCHITECTURE

Distributed Database Design

Distributed Database Systems

Distributed Database Services