Dalí Main-memory Storage Manager

Dalí Main-memory Storage Manager Tomasz Piech

Salvador Dalí - Persistence of Memory (1931)

Introduction • Dalí • Implemented at Bell Laboratories • Storage manager for persistent data • Architecture optimized for databases resident in main memory • Application – real-time billing and control of multimedia content deliver • High transaction rates, low latency

Introduction • Dalí Techniques • Direct access to data – direct pointers to information stored in dbase – high performance • No interprocess communication – communication with server only during dis/connection; concurrency, logging provided via shared memory • Fault-tolerant – advanced, multi-level transaction model; high concurrency indexing and storage

Introduction • Dalí • Recovery from process failure in addition to system failure • Use of codewords and memory protection – integrity of data (discussed later) • Consistency of response time – key requirement for applications with memory-resident data • Designed for databases that fit into main memory (virtual will work but not as well)

Overview of Presentation • Architecture • Storage • Transaction Management • Fault Tolerance • Concurrency Control • Collections and Indexing • Higher Level Interfaces

Architecture

Architecture • Database files – user data, one or more exist in database • System database files – database support related data, such as locks and logs • Files opened by a process are directly mapped into its address space • mmap files or shared-memory segments used to provide mapping

Layers of Abstraction Dalí architecture is organized to support the toolkit approach

Layers of Abstraction • Toolkit approach • Logging can be turned off for data which need not be persistent • Locking can be turned off if data is private to a process • Multiple interface levels • Low-level components are exposed to user for optimization

Storage

Pointers and Offsets • Each process has a database-offset table • Specifies where in memory a file is mapped • Implemented as an array indexed by file id • Primary Dalí pointer (p) • Dbase file local-identifier & offset within file • To dereference, add offset from p to virtual memory address from offset table • Secondary pointer • Index in one file, store just the offset since location of file is known

Storage Allocation • Motivation • Control data should be stored separately from user data • protection of control data from stray pointers • Indirection should not exist at the lowest level • Indirection adds a level of latching for each data access & increases path length for dereferecing itself • Dalí exposes direct pointers to allocated data, provides time and space efficiency

Storage Allocation • Motivation • Large objects should be stored contiguously • Advantage is speed; recreating a file from smaller files takes away that advantage • Different recovery characteristics should be available for different regions of the database • Not all data needs to be recovered from a crash • Indexes can be rebuilt, etc.

Storage Allocation • Two levels of non-recovered data • Zeroed memory – remains allocated but is zeroed • Transient memory – data no longer allocated upon recovery

Segments and Chunks • Segment • contiguous page-aligned unit of allocation; arbitrarily large; database files are comprised of segments • Chunk • A collection of segments

Segments and Chunks

Segments and Chunks • Allocators • Return standard Dalí pointers to allocated space within a chunk; indirection not imposed at storage manager level • No record of allocated space is retained • 3 different allocators • Power-of-two – allocates buckets of size 2i*m • Inline power-of-two – as above + free space list uses 1st few bytes of each free block

Segments and Chunks • Allocators (cont’d) • Coalescing allocator – merges adjacent free space & uses a free tree • Power of 2 inline faster but neither coalesces adjacent free space – fragmentation (thus fixed size records only) • Coalescing uses free tree – based on T-tree – to keep track of free space; logarithmic time for allocation and freeing

Page Table & Segment Headers • Segment header – associate info about a segment/chunk with a physical pointer • Allocated when segment is added to a chunk • Can store additional info about data in segment • Page table – maps pages to segment headers • Pre-allocated based on max # of pages in dbase

Transaction Management Recovery System Overview Checkpointing

Transaction Management in Dalí • Transaction atomicity, isolation & durability in Dalí • Regions - logically organized data • A tuple, an object or arbitrary data structure (a tree or a list) • Region lock - X or S lock that guards access/updates to a region

Multi-Level Recovery • Permits use of weaker operation locks in place of X/S region locks • Example, index management • An update to index structure (i.e. Insert) • Physical undo description must be valid until transaction commit • Unacceptable level of concurrency

Multi-level Recovery • Replace low-level physical undo log records with higher-level logical undo log records (description at operation level) • Insert – logical-undo record replaces physical-undo record by specifying that the inserted key must be deleted • Region locks can be released and less restrictive operation locks persist  higher level of concurrency

Multi-level Recovery • An example of find and insert ? • Releasing region locks would allow updates on the same region • Cascading aborts - rolling back the first operation would damage effects of later actions • Only compensating undo operation can be used to undo the operation

Multi-level Recovery Example

System Overview • Stored on disk: • Two checkpoint images Ckpt_A & Ckpt_B • cur_ckpt – anchor to the most recent valid checkpoint image for database • Single system log containing redo information, its tail in memory • end_of_stable_log – pointer; all records prior to it were flushed to stable system log

System Overview

System Overview • Stored in the system database & with each checkpoint • Active Transaction Table (ATT) • Stores separate redo & undo logs for each active transaction • dpt – dirty page table; stores pages updated since the last checkpoint • ckpt_dpt – dpt in a checkpoint

Transactions and Operations • Transaction – a list of operations • Each op. has a level Li associate with it • Op at level Li is can consist of ops of level Li-1 • L0 are physical updates to regions • Pre-commit – the commit record enters the system log in memory • Commit - commit record hits the stable storage

Logging Model • Updates generate physical undo and redo log records appended to Tx’s undo & redo logs (in ATT) • When Tx pre-commits, redo appended to system log, and logical-undo included in operation commit log in system log • When operation pre-commits, undo log records are deleted for its sub-operations/updates from Tx’s undo log & this operation’s logical undo appended to Tx’s undo log

Logging Model • Locks released once Tx/operation pre-commits • System log flushed to disk when Tx commits • Dirty pages are marked in the dpt by he flushing procedure – no page latching

Ping-pong Checkpointing • Traditionally, systems implement WAL for recovery – it is impossible to enforce WAL without latches • Latches increase access cost in main memory & interfere with normal processing • Solution, store two copies of dbase image on disk; dirty pages written to alternate checkpoints • Fuzzy checkpointing – no latches used, no interference with normal operations

Ping-pong Checkpointing • Checkpoints are allowed to be temporarily inconsistent – updates written out without undo records • Redo and undo info from ATT is written out to a checkpoint and brings it to a consistent state • If failure occurs, the other checkpoint is still consistent and can be used for recovery

Ping-pong Checkpointing • Log flush necessary at end of checkpointing before toggling cur_ckpt – commit might take place before writing out ATT, leaving no undo information if system crashes

Abort Processing • Upon abort, undo log records undone by sequentially traversing undo log from end • New physical-redo log record created for every physical-undo encountered • Similarly, for logical-undo “compensation” operation is executed (“proxy) • All undo log records deleted when proxy commits

Abort Processing • Commit record for proxy is similar to compenstation log records (CLRs) in ARIES • During recovery, logical-undo log record deleted from Tx’s undo log if a CLR encountered, preventing Tx from being undone gagin

Recovery • end_of_stable_log is where recovery begins • Initializes ATT and undo logs with copies from last checkpoint • Loads database image and sets dpt to zero • Applies all redo log following begin-recovery-point • Then all active transactions are rolled back • First all completed L0 operations must be rolled back then L1, then L2 and so on.

Post-commit Operations • Operations guaranteed to be carried out after commit of a transaction/operation even if the system crashes • Some operations cannot be rolled back once performed (deletion then allocation of same space to different operation) • Need to ensure high concurrency on storage allocator – cannot hold locks • Solution – perform these operations after transaction commits (keep post-commit log)

Fault Tolerance Process Death and Its Detection

Fault Tolerance • Techniques that help cope with process failure scenarios

Process Death • Caused by an attempt to access invalid memory, or by an operator kill • Must return shared data partially updated to consistent state • Abort any uncommitted transactions owned by that process • Cleanup server is primarily responsible for cleaning up dead processes

Process Death • Active Process Table (APT) – keeps track of all processes in the system; scanned periodically to check if any are dead • Low-level clean up • Process registers with APT any latch acquired • If latch held by dead process clean up function for that latch is called • If not possible to clean up latch then simulate system crash

Process Death • Cleaning up Transactions • Clean-up agent – scan Tx table and abort any Tx running on behalf of the dead process or execute post-commit actions for committed Tx • Multiple clean up agents spawn if multiple processes have died

Protection from Application Errors • Memory protection • munprotect called right before an update to a page and mprotect after Tx commits to protect pages • Codewords • associate logical parity word with each page of data • Erroneous writes will update only physical data not codeword – crash simulated if error found

Concurrency Control Implementation of Latches

Concurrency Control • Concurrency control facilities: • Latches (low-level locks for mutual exclusion) • Queuing locks • Latch Implementation • Semaphores too expensive – system call overhead • Implementation must complement cleanup server

Latch Implementation

Latch Implementation • Processes that wish to acquire a latch keep a pointer to that latch in their wants field • cleanup-in-progress flag forbids processes to attempt to get a latch is set to True • Cleanup server waits for process to set their wants fields to null or another lock or to die • If a dead process is a registered owner of the latch, cleanup function is called

Locking System • Lock header structure • Stores a pointer to a list of locks that have been requested (but not released) by transactions • Request times out if not granted in a certain amount of time • Add new lock modes with the use of conflicts and covers • covers – holder of lock A checks for conflicts when requesting new lock of type B, unless A covers B

Dalí Main-memory Storage Manager