390 likes | 522 Views
Lecture 8 on Physical Database. DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager which has a view of the database as a collection of pages, and that view is supported by the disk manager. The DBMS, file manager and disk manager.
E N D
Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager which has a view of the database as a collection of pages, and that view is supported by the disk manager.
The DBMS, file manager and disk manager DBMS Request stored record Store record returned File manager Request stored page Stored page return Disk Manager Disk I/O operation Data read from disk Stored database
Physical database design The physical database design is initiated to a certain extent in the “logical” design. The physical organization is determined largely by the need for operational efficiency, fast response times, and cost minimization. Most data storage devices record data as a stream of bits. The groups of bits which we can read with one machine instruction are called physical records. The physical records are stored at locations which are identified by a means of machine addresses. A program identifies a logical record or sequent by means of a key.
Sequential file With the physical sequential access method, the physical records are stored in logical sequence. If the storage medium to be used is a tape, the programmer has to present the physical records in a logical sequence. If the storage medium is a direct access one, the system will interconnect the physical records so that they are in logical sequence, even if they were not presented in logical sequence. The records must be read in a fixed sequence from begin to end sequential.
Overflow area With ISAM files the records are grouped so as to fit onto physical disk tracks, and one track on each cylinder contains an index to the records stored in that cylinder. When new records are inserted after the original sequential file has been set up these are stored in an overflow area. The index track contains pointers both to the prime data area and to the overflow area.
Indexed File: B+-Tree A common scheme for extremely large files is to induce a hierarchy of indices that follow the hierarchical nature of the secondary storages devices on which the file resides. We can view the hierarchy of indices as a tree. The benefit of index file is efficient space utilization in searching indexes.
B+-tree parameters B+-trees are defined to use a particular insertion/deletion strategy that ensures no node, except the root, is less than half full. In general, we define index block values as: 2d -1 ≥ Order where Order is the maximum pointer in each index block. d = number of search values in index block Similarly, we define leaf block values as: 2e -1 ≥ Order where Order is the number of key values in each leaf block. e = number of minimum key values in leaf block For example, if the order is 3, then the number of key values in an index block is d = 2 such that 2d-1>3. Also, the number of key values in a leaf block is 3 such that 2e-1>3 where e=2.
Insert To insert a record with key value v, apply the lookup procedure to find the block B in which this record belongs. If there are room (< 2e-1 records) in B, insert the new record in B. If there is no room (=2e-1 records) in block B, create a new block B1 and divide the records from B and the inserted record into two groups of e record each. The effects of inserting a record into B can ripple up the tree for several levels up to the root.
Initial B+-tree in unsorted sequence of (1,4,9,16,25,49, 64,81, 36,100,121,144,169,196,225,256)
Deletion If we delete the record with key value v, we use the lookup procedure to find this record. If after deletion, block B has more than half (e) records,we are done. If, after deletion, block B has less than half (e-1) records, we look for a neighbor block B1. If B1 has more than half ( e) records, we distribute records of B and B1; otherwise combine B with B1, which will have exactly 2e-1 records, and in the parent of B, modify the record for B1 and delete record for B.
Deletion If the deleted record was the first in block B, then we go to the parents of B to change the key value in the record for B. If B is the first child of its parent, the parent has no key value for B, so we must go to the parent’s parent and so on, until we find an ancestor A1 of B such that A1 is not the first child of its parent A2. Then the new lowest key value of B goes in the record of A2 that points to A1.
Lookup (Search B+ tree) Let us search for a record with key value v. We find a path from the root of B+-tree to some leaf, where the desired record will be found if it exists. Suppose we have reached node (block) B. If B is a leaf, then examine block B for a record with key value v. If B is not a leaf, it is an index block. Determine which key value in block B covers v. In the record of B that covers v is a pointer to another block. That block follows B in the path being constructed.
B+-tree block access operations Given a B+-tree with n records and e values in the leaf, and d values in each branch. The tree will have no more than n/e leaves, no more than n/(de) parents of parents of leaves and so on. If there are i nodes on paths from the root to leaves, then n>di-1e. It follows that i1+logd(n/e) where i = number of I/O costs to access leaf block For example, if n=1000000, e=5 and d=50, the number read/writes of blocks in an operation is i 1+log50(200000) 4.12 5 (round up) i = 4 (round down) Notice that log50(200000) = log (200000) / log(50)
Hash files Hashing is a form of address calculation technique which can convert an item’s key into a near-random number used to determine where the item is stored. The near-random number refer to the address where a record is stored. The number of logical records stored in this area is referred to as the bucket capacity. The benefit of hashing is fast response time but with poor space ultilzation.
Factors in hashing addressing • The bucket size is a certain number of address spaces made available. • The packing density (number of buckets for a file of a given size) • The hashing key-to-address transaction. If the key is not numeric, convert it into numeric. The keys are converted into a spread of numbers of the order of magnitude of the address numbers required. The resulting numbers are multiplied by a constant which compresses them to the precise range of address. • The method of handling overflows. It is desirable to minimize the bucket-searching operation at the expense of more overflows.
Hashing algorithms • Remainder method: divide key by a number and let the remainder as the physical address of the record. • Midsquare method: the key is multipled by itself and the middle few digits of the square are used as the index. • Folding method: breaks up a key into several segments that are added or exclusive ORed together to form a hash value.
Remainder hashing algorithm example Suppose supplier number values are S100, S200, S300, S400, S500, and each stored supplier record requires an entire page to itself. By using hash function division/remainder. The page numbers for the five suppliers are then 9, 5, 1, 10, 6 and the divider is 13. For example, reminder of 100/13 is 9. 0 1 2 3 4 5 S200 S300 6 7 8 9 10 11 S500 S100 S400 12
Folding method hash algorithm example Suppose that the internal bit string representation of a key is 010111001010110 and that 5 bits are allowed in the index. The three bit strings 01011, 10010 and 10110 are exclusive ORed to produce, i.e., 0 ORed 0 = 0 1 ORed 0 = 1 0 ORed 1 = 1 1 ORed 1 = 0 01111, which is 15 as a binary integer.
MidSquare hashing algorithm example This method treats the key as a single large number, square the number, and extract whatever number of digits is needed from the middle of the result. Suppose you want to generate addresses between 0 and 99. If the key is the number 453, its square is 205,209. Extracting the middle two digits yields a number between 0 and 99, in this case 52.
Inverted list file Logical record order can be maintained using inverted list which is a table that cross references record addresses with some field value. The benefit of inverted file is to implement secondary index, that is, an alternative index besides prime index.
Multi-list file There is one entry in the secondary key’s index for each value that the secondary key presently has in the data file. The entry in the multi-list index for a key value has just one pointer to the first data record with that key value. The data records contains a pointer to the next data record with that key value, and so forth. There is a linked list of data records for each value of the secondary key. Multi-list chains are bi-directional, and occasionally are circular to improve update performance of a database. The benefit of multi-list file is to implement duplicate secondary indexes.
Lecture Summary Four kinds of file structures have been introduced to design physical database in order to implement logical database schema. Firstly, B+ tree is good for both performance with indexing and space utilization with balanced blocks. Secondly, hashing is good for fast response in searching key value. Thirdly, inverted file is good for secondary indexing. Fourthly, multi-linked lists is good for duplicate secondary indexing.
Review question How to compare the efficiency of physical database storage by use of B+-tree file, hashing file, inverted list file and multi-list file?
Tutorial Question What is a B+-tree and what are its components. Given number of search-key values that fit in one index node is 2 and in one leaf node is 3. Construct a B+-tree for the following set of key values (2, 3, 5, 7, 11, 17, 19, 23, 29, 31). Show how to use this B-tree to find record with search-key value 11. Show B-tree after inserting search-value 9 and after deleting search value 17.
Reading Assignment Chapter 14 Indexing Structures for Files of “Fundamentals of Database Systems:, 5th edition, by Elmasri and Navathe, Pearson International Edition, 2007, pp.500-531.