440 likes | 598 Views
CSC 556 – DBMS II, Spring 2013. April 10 & 17, 2013 Storage media hierarchies, external sorts, B-trees. Storage medium abstraction. Storage medium as a subclass of an interface allows you to prototype storage medium semi-independently from DBMS structures atop it. Sequential File I/O.
E N D
CSC 556 – DBMS II, Spring 2013 April 10 & 17, 2013 Storage media hierarchies, external sorts, B-trees
Storage medium abstraction • Storage medium as a subclass of an interface allows you to prototype storage medium semi-independently from DBMS structures atop it.
Sequential File I/O • Queue abstraction supports sequential file I/O or core I/O via a series of enqueue, peek & dequeue calls. Supplies one-record lookahead. • External sorts, i.e., sorts where the data do not fit into memory, rely on this approach. • A record size entry can precede any variable length record in the file, OR • A sentinel value can mark a record’s end.
Direct Access File I/O • Direct access (a.k.a. random access) file I/O uses a seek system call to locate a position with a direct-access (binary) data file. • man lseek, fseek, ftell • Offsets are from 0, or current seek position, or end. • Unix 3C library builds atop level 2 system calls. DBMS may access level 2 system calls directly. • It treats the file as an array where a seek address is an offset into the array. • Applications attempt to keep contiguous records in contiguous blocks.
Merge sort & Sequential File I/O • The split phase partitions alternate runs from a queue into two helper queues, where the initial queue is the file to be sorted, and the helpers are temporary files. • A run is a sorted subsequence. • The merge phase doubles the run length by merging peer runs from the helper queues back into the initial queue. • Natural-length merge sort inspects the data to locate run boundaries. It may stumble onto big runs. • Another variants uses an internal sort such as quicksort to build larger initial runs in memory. • A file-based sort is an external sort.
April 10 example of merge sort • End of split phase with run length = 1. • End-of-run record is underlined. • Destination of this phase is in bold.
April 10 example of merge sort • End of split phase with run length = 1. • End of merge phase grows run length to 2.
April 10 example of merge sort • End of split phase with run length = 2. • End of merge phase grows run length to 4.
April 10 example of merge sort • End of split phase with run length = 4. • End of merge phase grows run length to 8.
April 10 example of merge sort • End of split phase with run length = 8. • End of merge phase grows run length to 16.
Merge sort is O(n log(n)) • Picture a merge sort as a tree growing up from N runs of length 1 to 1 run of length N.
April 10 radix10 sort (bucket sort) • This sort also uses sequential file I.O. • Initial sequence of integers. • Sequence normalized to non-negative values with digits to accommodate the largest.
Number temp queues = radix • Sort is O(N x D) for N items and D digits, but • D is a constant and does not affect growth rate • Radix sort is therefore O(N) on data size • It requires a fixed-bit-width key field. • Fixed-width fields required for a relational DBMS. • It has a lot of copy overhead. • Use a radix with many bits, many (smaller) files.
Base-2numbits code. void radixsort(interface_queueOfInts *queueToSort, int numbits, interface_queueOfInts * temporaryQueues[]) { const int numqueues = (1 << numbits); int mask = ~(~0 << numbits); const int allbits = sizeof(int) * 8 ; // bits per sorted value for (int shifter = 0 ; shifter < allbits ; shifter += numbits) { splitphase(queueToSort, temporaryQueues, shifter, mask); mergephase(queueToSort, temporaryQueues, numqueues); } }
Base-2numbits code. static void splitphase(interface_queueOfInts *merger, interface_queueOfInts * splitter[], int shifter, int bitmask) { bool ignoreme ; // We only peek on queues with data, etc. while (merger->canPeek()) { // while there are runs to split int value = merger->peek(ignoreme); merger->dequeue(); int qid = (value >> shifter) & bitmask ; splitter[qid]->enqueue(value);}}
Base 2numbits code. static void mergephase(interface_queueOfInts *merger, interface_queueOfInts * splitter[], int numqueues) { bool ignoreme ; for (int qtodrain = 0 ; qtodrain < numqueues ; qtodrain++) { while (splitter[qtodrain]->canPeek()) { merger->enqueue(splitter[qtodrain]->peek(ignoreme)); splitter[qtodrain]->dequeue(); } }}
MultiSet is a B+-tree Map over various storage subclasses • DataMine/mset/MultiSet.h
Relational DBMS • Flat, fixed width records (tuples) fit into contiguous memory locations & file blocks. • Take the least common multiple (LCM) of the record size and block size, and allocate using that. • Unix lseek and fcntl are the primary system calls. Windows has counterparts. The low-end cylinder allocation on the disk is managed by the operating system.
B-trees and index files • B-trees are balanced binary trees; typically degree D > 2, as determined by block size. • How many B-tree node records fit into a disk block? • Each node holds between D/2 and D entries. • The root is an exception. It can hold < D/2 entries. • In B+-trees the leaves hold the actual data, typically as seek indices into the contiguous database file. B+-trees also link the leaves into a sorted chain for range-based serial access. • B+-tree interior nodes hold pointers to children. • A B-tree grows from the bottom up whenever an insertion causes a node to split.
MultiSet • MultiSet uses sets of keys or multisets of keys (duplicate keys allowed) to map to application data elements. • Search includes ==, <, <=, > or >= key. • First or last key occurrence for actual multisets. • Greatest lesser value when key is not present. • Also supports least greater value. • Serial linked list at leaves (B+ tree) supports duplicate keys & iterating over results, including following operations. • A result is a MultiSet amenable to union, intersection and set difference with other MultiSet objects.
MultiSet.h typedef unsigned long location ; const unsigned BTREEDEGREE = 16 ; template <class KeyType> struct treenode { // basic internal node location parent ; // type treenode location child[BTREEDEGREE] ; // treenodes or leafnodes KeyType key[BTREEDEGREE] ;} // min keys for those children template <class KeyType> struct leafnode { // leaf connects treenodes to treeelems location parent ; // type treenode location prev, next ; // siblings or cousins location child[BTREEDEGREE] ; // treeelems KeyType key[BTREEDEGREE] ;} // keys for those children
MultiSet.h template <class ElementType> struct treeelem { /* a btree leaf element */ ElementType element ; // main contents location next ; // next avail. if needed for a free list } ; template <class ElementType, class KeyType> class MultiSet { /* ElementType is the type of the set's elements. KeyType is the type, typically part or possibly all of an ElementType object, that constitutes a search key. ... */
Abstract location • Location is an unsigned long that is either a seek offset (file) or a cast of an object pointer. • MultiSet records the depth of the tree. • Interior nodes are type treenode. • Leaf nodes are type leafnode. • Leaf nodes point to treeelem application data. • Those may be app data or may be record indices into another file of flat data records.
Searching through the B-tree • findleaf uses findsubtree on interior nodes. • Uses O(log n) binary search on interior node. • Calls findsubtree in a loop until hitting the leaves. • Multi-key version finds first or last instance. • findslot returns the array index inside a node.
Insertion in the B-tree • randominsert starts at the root • Initial element is a special case. • Root node is always special because it may contain fewer than degree / 2 entries. • Otherwise, if it fits into a node, put it there. • When the node is full, split into two nodes of size N/2 and N/2 + 1. • Add the new node to its parent. • If that parent node was already full, split it recursively.
Deletion from a B-tree • There is no problem when number of entries remains >= N/2. Just slide entries above the deleted entry down to cover the deleted one. • When entries < N/2, try merging with neighbors in a serial chain (siblings or cousins). • If there are too few for that, merge into one node. Other is empty, delete it from its parent. • If parent entries < N/2, perform recursive delete. • See deleteelem and deleteleaf in MultiSet.
http://en.wikipedia.org/wiki/B-tree • keeps keys in sorted order for sequential traversing • uses a hierarchical index to minimize the number of disk reads • uses partially full blocks to speed insertions and deletions • keeps the index balanced with an elegant recursive algorithm • minimizes waste by making sure the interior nodes are at least half full
CoreMultiSet & FileMultiSet • storage read / write via abstract location • readnode / readleaf / readelem • writenode / writeleaf / writeelem • allocnode / allocleaf / allocelem • freenode / freeleaf / freeelem • FileMultiSet maintains free lists of the above. • Flat file structure in main relational file makes storage management of free list “easy.”
Other Indexing – Skip Lists • Skip lists support probabilistic log(N) lookup, insertion & deletion of a key -> value mapping. • We will review Pugh’s paper from the 1990’s. • A skip list links each mapping in a series of key-sorted linked lists, in which higher-order lists contain fewer members, acting as “highways” & “boulevards” in locating keys. • Skip lists provide better support for concurrency than some balanced tree algorithms by avoiding global tree restructuring on rebalancing.
Other Indexing – Hashing • Hashing approaches O(1) (constant-time) lookup for an ideal hash function. • The ideal hash function preserves all of the bits of distinguishing information is a search key, while folding them into fewer bits to use as a lookup index into an array, an index file, or a flat file of fixed-width records. • Hash tables disambiguate key collisions either by storing colliding elements in a linked list (chained hashing) or by rehashing to a new bucket (open address hashing). • Hashing supports only == tests, not <, <=, >, >=.
Other Indexing – Sorting • When a sequence of fixed-size records or indices are sorted in a file of contiguous records, binary search is a viable option for locating a key. • Hashing and sorting are particularly appropriate for indexing on-the-fly, temporary result sets to be combined via intersection, union or set difference. • Approximate O(1) hashing is fast when combining result sets based only on equality. • Sort-based search and merging are appropriate when a query requests a result set sorted on an attribute.
Query Processing (Elmasri & Navathe chapter 19) • Translate SQL statements into abstract syntax tree using basic compilation techniques. • Interpret the abstract syntax tree. • ORDER-BY and elimination of duplicate tuples in a PROJECT are supported by external sort. • Duplicate elimination can use low-level memcmp byte comparison when comparing fixed-size records. • If SORT is based in an index key or composite index key, or if query does not entail PROJECT, then a B+-tree index avoids need for a sort.
SELECT Processing • Use indexed fields as primary search keys. • Use index-based set intersection, union and difference operations to support AND, OR and NOT. • Use slow (O(n)) sequential search, or external sorting (O(n log(n)) where appropriate (ORDER-BY or duplicate elimination) combined with binary search. • Use hashing for == matching. • Use composite search key indexing or hashing. • Utilize selectivity where possible.
JOIN Processing • Nested-loop is brute force approach (O(NK)). • Single-loop when join attributes are indexed. • Sort-merge join requires records to be sorted on the join attributes, and then merged. • Partition-hash join hashes smaller of two contributing relations into chained hash table. • Larger relation is then hashed on join attributes to retrieve tuples from the smaller.
PROJECT Processing • If projection includes a distinguishing key, the projected tuples are already unique. Just select a subset of the query results. • Otherwise, DISTINCT projections require sorting based on entire tuple to eliminate duplicates. • It is also possible to hash on entire tuples to eliminate duplicates.
Approaches to Query Optimization • Pipelined or stream-based processing of query stages across multiple thread / memory units. • Compiler optimization techniques such as common subexpression elimination. • Functional approaches such as lazy evaluation. • Use meta-data and heuristics such as size of contributing relations (smaller is faster and may fit into memory), key distribution data.
Costs to consider • Access cost (disk I/O), e.g., NFS vs. local disk. • Disk storage cost for intermediate files. • Computation costs (O(?)) cost. • Memory usage cost (avoid thrashing). • Communication cost for distributed systems. • Maintenance cost in terms of resources & availability of the database.
Physical Database Design • Chapter 20 in Elmasri & Navathe textbook. • Accumulate query statistics & data mine them. • What attributes to index? • When to use a clustered index on a non-key. • Actual dataset can be organized on only 1 key. • Hashing works well on equality-only joins. • Dynamic hashing for volatile files.
Physical Database Techniques • CREATE [ UNIQUE ] INDEX <index name> ON <table name> (<column name> [ <order> ] { , <column name> [ <order> ] } ) [ CLUSTER ] ; • Denormalization demotes normalized tables to weaker forms for increased speed. • Vertical partitioning splits relations over attributes to speed projection dynamics. • Horizontal partitioning splits relations over indexed tuples to speed selection dynamics.
Collect statistics • Storage statistics include data about table spaces, index spaces, buffer pools. DBMS may require preallocation and sizing / tuning for clustering. • I/O and device performance statistics include read / write (paging) on disk intents, hot spots, and thrashing for core memory. • NFS/local/in-core, number of network interface cards, amount of core memory, cache, memory topology. • Query / transaction statistics help determine attributes to index & query distributions.
Tuning queries • Precompiled queries offer opportunities for profiling and speed improvement. • Avoid generating larger intermediate result sets when smaller ones are available. • Avoid nested queries that generate large cross-products in favor of sequential queries. P. 737 example potential to search all of M for each tuple from E. SELECT Ssn SELECT MAX(Salary) AS High_salary, Dno INTO TEMP FROM EMPLOYEE E FROM EMPLOYEE GROUP BY Dno ; WHERE SELECT EMPLOYEE.Ssn FROM EMPLOYEE, TEMP Salary = SELECT MAX(Salary) WHERE EMPLOYEE.Salary = TEMP.High_salary FROM EMPLOYEE AS M AND EMPLOYEE.Dno = TEMP.Dno ; WHERE M.Dno = E.Dno ;