Lecture 5: Record Storage and Primary File Organizations

Lecture 5: Record Storage and Primary File Organizations • Storage Devices • Storage of Databases • Operations on Files • Primary vs. Secondary File Organizations • Heap Files • Sorted Files • Hashing

Storage Devices • Computer Storage Medium (Hierarchy) • Factors: cost, capacity, speed • Primary Storage – data processed directly by the CPU; main memory, cache memory • Secondary (on-line) Storage - data must first be copied into primary storage for processing; magnetic disks • Secondary (off-line) Storage - optical disks (direct access), magnetic tapes (sequential)

Storage of Databases • Main Memory Databases • entire databases are kept in main memory • main memory is a volatile storage: requires a backup copy (on magnetic disk) • Most Databases • are stored permanently on magnetic disk • are too large to fit entirely in main memory • magnetic disk is less expensive

File Records on Disk • Records • file as a sequence of records (fig5.7) • record type = field names + data types • Fixed-Length Records • records with the same size in a file • Variable-Length Records (with separators) • records of different sizes • caused by multi-valued fields, optional fields, or variable-length fields

File Blocks on Disk • Disk Block (fig5.8) • unit of data transfer between disk & memory • records of a file are allocated to disk blocks • usually 512 to 4K bytes (K=1024) • Blocking Factor (bfr) • number of (fixed-length) records in a block • bfr = B/R (floor function) • B = block size, R = record size (in bytes)

File Blocks on Disk • Spanned vs. Unspanned File Org. (fig5.8) • Unspanned: leaves the remaining space in each block unused • Spanned: utilizes the unused space • Contiguous vs. Linked Allocation • Contiguous: file blocks are allocated to consecutive disk blocks • Linked: each file block contains the pointer to the next block

Operations on Files • Types of Operations • Retrieval: do not change data in the file (open/close a file, find/read records) • Update: change the files by insertion, deletion or modification of records • Record-at-a-time: operations are applied to a single record • Set-at-a-time: operations are applied to a set of records or to the whole file

Operations on Files • File Open/Close Operations • Open: readies the file for access, allocates buffers to hold file blocks, sets the file pointer to the beginning of the file • Close: terminates access to the file • Set-at-a-time Operations • Find: searches for the first file record that satisfies a certain condition (selection condition), and makes it the current file record

Operations on Files • FindNext: searches for the next file record (from the current record) and makes it the current file record • Read: reads the current file record • Insert: inserts a new record into the file and makes it the current file record • Delete: removes the current file record from the file by marking the record to indicate that it is no longer valid

Operations on Files • Modify: changes the values of some fields of the current file record • Record-at-a-time Operations • FindAll: locates all the records satisfying a search condition • FindOrdered: retrieves all the records in a specific order • Reorganize: reorganizes the records after update operations

Operations on Files • Operation Factors • Access Type: attribute value(=) or range(>) • Access Time: to find a particular record(s) • Insertion Time: to insert a new record (find the place to insert + index structure update) • Deletion Time: to delete a record (find the record(s) to delete + index structure update) • Space Overhead: additional space occupied by an index structure

Primary vs. Secondary File Organizations • Primary File Organizations • Heap Files • Sorted Files • Hashing • Secondary File Organizations (Index) • Single-level or Multi-level Indexes • B-trees • B+-trees

Heap Files • Files of Unordered Records • simplest and basic file organization • new records are inserted at the end of the file • Access: linear search requires searching through the file block by block (N/2 file blocks on average if the record exists, N file blocks if not), very inefficient (it takes O(N) time) • Insertion: very efficient (random order) • Deletion: must first find its block, inefficient

Heap Files • Direct File • allows direct access by the position of a record in a file • applies only to fixed-length records, contiguous allocation, and unspanned blocks • file records: 0, 1, … , r-1 (i.e., 120) • records in each block (bfr): 0, 1, … , bfr-1 (15) • ith record of a file (43): block position = (i/bfr), record position in the block = (i mod bfr)

Sorted Files • Files of Ordered Records • file records are kept sorted by the values of an ordering field (sequential file): fig5.9 • Access: binary search (on its ordering field) requires reading and searching log2 of the file blocks on the average (O(logN) time), improvement over linear search • Insertion: records must be inserted in the correct order, very inefficient

Sorted Files • Files of Ordered Records (con’t) • Deletion: inefficient, less expensive with deletion marker and periodic reorganization • FindOrdered: reading the records in order of the ordering key values is extremely efficient • Overflow: temporary unordered file for new records to improve insertion efficiency, periodically merged with the main ordered file

Hashing • Hash Functions • records in the file are unordered • determine the address (B) of a record based on the value of the hash field (K) in the record • h(K) -> B • ex) h(K) = K mod M (1, 2, … , M-1) • allow direct access to the target disk block • record search in the block: main memory

Internal Hashing • Internal Hashing • hashing for an internal file • hash table as an array of records (fig5.10) • noninteger hash field value such as names can be transformed into an integer (ASCII) • Collision (of hash addresses) • occurs when two hash field values are mapped into the same hash address

Collision Resolution • Open Addressing • checks the subsequent positions in order until an empty position is found • Chaining • extend the array with a number of overflow positions • use a linked list of overflow records for each hash address • overflow pointer refers to the position of the next record (fig5.10(b))

Collision Resolution • Multiple Hashing • applies a second hash function if the first hash function results in a collision • uses open addressing or applies a third hash function if another collision results • Good Hashing Function • uniform and random distribution of records • hash table 70-90% full to minimize collisions with less unused locations

External Hashing • Hashing Function • target address space is made of buckets (one disk block or a cluster of contiguous blocks) • maps a hash field value into a bucket number • bucket number is then converted to the corresponding disk block address (fig5.11) • collision is less severe with buckets because as many records as will fit in a bucket

External Hashing • Bucket Overflow • when a bucket is filled to capacity • can be solved by chaining method: fig5.12 • a pointer is maintained in each bucket to a linked list of overflow records for the bucket • record pointers include both a block address and a relative record position within the block

External Hashing • Static Hashing • very fast access to records by the hash field • a fixed number of buckets M is allocated • not suitable for dynamic files (grows and shrinks dynamically) • difficult to determine the number of buckets in advance • requires a dynamic hashing technique

Dynamic Hashing • Extendible Hashing (fig5.13) • maintains a directory of 2d bucket addresses • uses first d bits of a hash value to determine a directory entry and then a bucket address • d = global depth, d’ = local depth of a bucket • directory expands and shrinks dynamically • bucket doubling (split) vs. halving (merge) • update directory and local depth appropriately

Lecture 5: Record Storage and Primary File Organizations