Representing Data Elements

Representing Data Elements Fields, Records, Blocks Variable-length Data Modifying Records Source: our textbook

Overview • Attributes are represented by sequences of bytes, called fields • Tuples are represented by collections of fields, called records • Relations are represented by collections of records, called files • Files are stored in blocks, using specialized data structures to support efficient modification and querying

Representing SQL Data Types • integers and reals: built-in • CHAR(n): array of n bytes • VARCHAR(n): array of n+1 bytes (extra byte is either string length or null char) • dates and times: fixed length strings • etc.

30 286 287 297 0 address VARCHAR(255) 256 bytes birthdate DATE 10 bytes name CHAR(30) 30 bytes gender CHAR(1) 1 byte Representing Tuples • For now, assume all attributes (fields) are fixed length. • Concatenate the fields • Store the offset of each field in schema

32 288 292 304 0 address VARCHAR(255) 256 bytes birthdate DATE 10 bytes + 2 name CHAR(30) 30 bytes + 2 gender CHAR(1) 1 byte + 3 More on Tuples • Due to hardware considerations, certain types of data need to start at addresses that are multiples of 4 or 8 • Previous example becomes:

Record Headers • Often it is convenient to keep some "header" information in each record: • a pointer to schema information (attributes/fields, types, their order in the tuple, constraints) • length of the record/tuple • timestamp of last modification

Packing Records into Blocks • Start with block header: • timestamp of last modification/access • offset of each record in the block, etc. • Follow with sequence of records • May end with some unused space … header record 1 record 2 record n-1 record n

Representing Addresses • Often addresses (pointers) are part of records: • the application data in object-oriented databases • as part of indexes and other data structures supporting the DBMS • Every data item (block, record, etc.) has two addresses: • database address: address on the disk (typically 8-16 bytes) • memory address, if the item is in memory (typically 4 bytes)

Translation Table • Provides mapping from database addresses to memory addresses for all blocks currently in memory • Later we'll discuss how to implement it

Pointer Swizzling • When a block is moved from disk into main memory, change all the disk addresses that point to items in this block into main memory addresses. • Need a bit for each address to indicate if it is a disk address or a memory address. • Why? Faster to follow memory pointers (only uses a single machine instruction).

Example of Swizzling Disk Main Memory read into main memory swizzled Block 1 unswizzled Block 2

Swizzling Policies • Automatic swizzling: as soon as block is brought into memory, swizzle all relevant pointers • Swizzling on demand: only swizzle a pointer if and when it is actually followed • No swizzling: always refer to translation table • Programmer control

Automatic Swizzling • Locating all pointers within a block: • refer to the schema, which will indicate where addresses are in the records • for index structures, pointers are at known locations • Update translation table with memory addresses of items in the block • Update pointers in the block (in memory) with memory addresses, when possible, as obtained from translation table

Unswizzling • When a block is moved from memory back to disk, all pointers must go back to database (disk) addresses • Use translation table again • Important to have an efficient data structure for the translation table

Pinned Records and Blocks • A block in memory is pinned if it cannot be safely written back to disk • Indicate with a bit in the block header • Reasons for pinning: • related to failure recovery (more later) • because of pointer swizzling • If block B1 has swizzled pointer to an item in block B2, then B2 is pinned.

Unpinning a Block • Consider each item in the block to be unpinned • Keep in the translation table the places in memory holding swizzled pointers to that item (e.g., with a linked list) • Unswizzle those pointers: use translation table to replace the memory addresses with database (disk) addresses

Variable Length Data • Data items with varying size (e.g., if maximum size of a field is large but most of the time the values are small) • Variable-format records (e.g., NULLs method for representing a hierarchy of entity sets as relations) • Records that do not fit in a block (e.g., an MPEG of a movie)

Variable-Length Fields • Store the fixed-length fields before the variable-length fields in each record • Keep in the record header • record length • pointers to the beginnings of all the variable-length fields • Book discusses variations on this idea

Variable Length Fields other header info to var len field 2 to var len field 3 fixed len field 1 fixed len field 2 var len field 1 var len field 2 var len field 3 record length

Variable-Format Records • Represent by a sequence of tagged fields • Each tagged field contains • name • type • length, if not deducible from the type • value

Splitting Records Across Blocks • Called spanned records • Useful when • record size exceeds block size • putting an integral number of records in a block wastes a lot of the block (e.g., record size is 51% of block size) • Each record or fragment header contains • bit indicating if it is a fragment • if fragment then pointers to previous and next fragments of the record (i.e., a linked list)

Record Modification • Modifications to records: • insert • delete • update • issues even with fixed-length records and fields • even more involved with variable-length data

Inserting New Records • If records need not be any particular order, then just find a block with enough empty space • Later we'll see how to keep track of all the tuples of a given relation • But what if blocks should be kept in a certain order, such as sorted on primary key?

Insertion in Order header record 4 record 3 record 2 record 1 unused If there is space in the block, then add the record (going right to left), add a pointer to it (going left to right) and rearrange the pointers as needed.

What if Block is Full? • Records are stored in several blocks, in sorted order • One approach: keep a linked list of "overflow" blocks for each block in the main sequence • Another approach is described in the book

Deleting Records • Try to reclaim space made available after a record is deleted • If using an offset table, then rearrange the records to fill in any hole that is left behind and adjust the pointers • Additional mechanisms are based on keeping a linked list of available space and compacting when possible

Tombstones • What about pointers to deleted records? • We place a tombstone in place of each deleted record • Tombstone is permanent • Issue of where to place the tombstone • Keep a tombstone bit in each record header: if this is a tombstone, then no need to store additional data

Updating Records • For fixed-length records, there is no effect on the storage system • For variable-length records: • if length increases, like insertion • if length decreases, like deletion except tombstones are not necessary

Representing Data Elements