731 likes | 2.62k Views
Lecture 5: Record Storage and Primary File Organizations. Storage Devices Storage of Databases Operations on Files Primary vs. Secondary File Organizations Heap Files Sorted Files Hashing. Storage Devices. Computer Storage Medium (Hierarchy) Factors: cost, capacity, speed
E N D
Lecture 5: Record Storage and Primary File Organizations • Storage Devices • Storage of Databases • Operations on Files • Primary vs. Secondary File Organizations • Heap Files • Sorted Files • Hashing
Storage Devices • Computer Storage Medium (Hierarchy) • Factors: cost, capacity, speed • Primary Storage – data processed directly by the CPU; main memory, cache memory • Secondary (on-line) Storage - data must first be copied into primary storage for processing; magnetic disks • Secondary (off-line) Storage - optical disks (direct access), magnetic tapes (sequential)
Storage of Databases • Main Memory Databases • entire databases are kept in main memory • main memory is a volatile storage: requires a backup copy (on magnetic disk) • Most Databases • are stored permanently on magnetic disk • are too large to fit entirely in main memory • magnetic disk is less expensive
File Records on Disk • Records • file as a sequence of records (fig5.7) • record type = field names + data types • Fixed-Length Records • records with the same size in a file • Variable-Length Records (with separators) • records of different sizes • caused by multi-valued fields, optional fields, or variable-length fields
File Blocks on Disk • Disk Block (fig5.8) • unit of data transfer between disk & memory • records of a file are allocated to disk blocks • usually 512 to 4K bytes (K=1024) • Blocking Factor (bfr) • number of (fixed-length) records in a block • bfr = B/R (floor function) • B = block size, R = record size (in bytes)
File Blocks on Disk • Spanned vs. Unspanned File Org. (fig5.8) • Unspanned: leaves the remaining space in each block unused • Spanned: utilizes the unused space • Contiguous vs. Linked Allocation • Contiguous: file blocks are allocated to consecutive disk blocks • Linked: each file block contains the pointer to the next block
Operations on Files • Types of Operations • Retrieval: do not change data in the file (open/close a file, find/read records) • Update: change the files by insertion, deletion or modification of records • Record-at-a-time: operations are applied to a single record • Set-at-a-time: operations are applied to a set of records or to the whole file
Operations on Files • File Open/Close Operations • Open: readies the file for access, allocates buffers to hold file blocks, sets the file pointer to the beginning of the file • Close: terminates access to the file • Set-at-a-time Operations • Find: searches for the first file record that satisfies a certain condition (selection condition), and makes it the current file record
Operations on Files • FindNext: searches for the next file record (from the current record) and makes it the current file record • Read: reads the current file record • Insert: inserts a new record into the file and makes it the current file record • Delete: removes the current file record from the file by marking the record to indicate that it is no longer valid
Operations on Files • Modify: changes the values of some fields of the current file record • Record-at-a-time Operations • FindAll: locates all the records satisfying a search condition • FindOrdered: retrieves all the records in a specific order • Reorganize: reorganizes the records after update operations
Operations on Files • Operation Factors • Access Type: attribute value(=) or range(>) • Access Time: to find a particular record(s) • Insertion Time: to insert a new record (find the place to insert + index structure update) • Deletion Time: to delete a record (find the record(s) to delete + index structure update) • Space Overhead: additional space occupied by an index structure
Primary vs. Secondary File Organizations • Primary File Organizations • Heap Files • Sorted Files • Hashing • Secondary File Organizations (Index) • Single-level or Multi-level Indexes • B-trees • B+-trees
Heap Files • Files of Unordered Records • simplest and basic file organization • new records are inserted at the end of the file • Access: linear search requires searching through the file block by block (N/2 file blocks on average if the record exists, N file blocks if not), very inefficient (it takes O(N) time) • Insertion: very efficient (random order) • Deletion: must first find its block, inefficient
Heap Files • Direct File • allows direct access by the position of a record in a file • applies only to fixed-length records, contiguous allocation, and unspanned blocks • file records: 0, 1, … , r-1 (i.e., 120) • records in each block (bfr): 0, 1, … , bfr-1 (15) • ith record of a file (43): block position = (i/bfr), record position in the block = (i mod bfr)
Sorted Files • Files of Ordered Records • file records are kept sorted by the values of an ordering field (sequential file): fig5.9 • Access: binary search (on its ordering field) requires reading and searching log2 of the file blocks on the average (O(logN) time), improvement over linear search • Insertion: records must be inserted in the correct order, very inefficient
Sorted Files • Files of Ordered Records (con’t) • Deletion: inefficient, less expensive with deletion marker and periodic reorganization • FindOrdered: reading the records in order of the ordering key values is extremely efficient • Overflow: temporary unordered file for new records to improve insertion efficiency, periodically merged with the main ordered file
Hashing • Hash Functions • records in the file are unordered • determine the address (B) of a record based on the value of the hash field (K) in the record • h(K) -> B • ex) h(K) = K mod M (1, 2, … , M-1) • allow direct access to the target disk block • record search in the block: main memory
Internal Hashing • Internal Hashing • hashing for an internal file • hash table as an array of records (fig5.10) • noninteger hash field value such as names can be transformed into an integer (ASCII) • Collision (of hash addresses) • occurs when two hash field values are mapped into the same hash address
Collision Resolution • Open Addressing • checks the subsequent positions in order until an empty position is found • Chaining • extend the array with a number of overflow positions • use a linked list of overflow records for each hash address • overflow pointer refers to the position of the next record (fig5.10(b))
Collision Resolution • Multiple Hashing • applies a second hash function if the first hash function results in a collision • uses open addressing or applies a third hash function if another collision results • Good Hashing Function • uniform and random distribution of records • hash table 70-90% full to minimize collisions with less unused locations
External Hashing • Hashing Function • target address space is made of buckets (one disk block or a cluster of contiguous blocks) • maps a hash field value into a bucket number • bucket number is then converted to the corresponding disk block address (fig5.11) • collision is less severe with buckets because as many records as will fit in a bucket
External Hashing • Bucket Overflow • when a bucket is filled to capacity • can be solved by chaining method: fig5.12 • a pointer is maintained in each bucket to a linked list of overflow records for the bucket • record pointers include both a block address and a relative record position within the block
External Hashing • Static Hashing • very fast access to records by the hash field • a fixed number of buckets M is allocated • not suitable for dynamic files (grows and shrinks dynamically) • difficult to determine the number of buckets in advance • requires a dynamic hashing technique
Dynamic Hashing • Extendible Hashing (fig5.13) • maintains a directory of 2d bucket addresses • uses first d bits of a hash value to determine a directory entry and then a bucket address • d = global depth, d’ = local depth of a bucket • directory expands and shrinks dynamically • bucket doubling (split) vs. halving (merge) • update directory and local depth appropriately