Foundations of Database Systems

Foundations of Database Systems Indexing Instructor: Zhijun Wang DBMS and Applications 1

Announcements Quiz#2 will be given next week DBMS and Applications 2

Physical Database Design DBMS and Applications 3

File structures Selecting among alternative file structures is one of the most important choices in physical database design. In order to choose intelligently, you must understand characteristics of available file structures. DBMS and Applications 4

Sequential Files Simplest kind of file structure Unordered: insertion order Ordered: key order Simple to maintain Provide good performance for processing large numbers of records DBMS and Applications 5

Unordered Sequential File DBMS and Applications 6

Ordered Sequential File DBMS and Applications 7

Indexes Indexes are special data structures used to improve database performance SQL Server automatically creates an index on all primary and foreign keys Additional indexes may be assigned on other columns that are: Frequently used in WHERE clauses Used for sorting data DBMS and Applications 8

Indexes SQL Server supports two kinds of indexes: Clustered index: the data are stored in the bottom level of the index and in the same order as that index Nonclustered index: the bottom level of an index contains pointers to the data Clustered indexes are faster than nonclustered indexes for updating and retrieval DBMS and Applications 9

Creating an Index: By GUI in Enterprise Manager DBMS and Applications 10

Hash Files Support fast access unique key value Converts a key value into a physical record address Mod function: typical hash function Divisor: large prime number close to the file capacity Physical record number: hash function plus the starting physical record number DBMS and Applications 11

Example: Hash Function Calculations for StdSSN Key Assume the file capacity is 100, 97 is the biggest prime number less than to 100. Physical Record DBMS and Applications 12

Hash File after Insertions DBMS and Applications 13

Linear Probe Collision Handling During an Insert Operation DBMS and Applications 14

Multi-Way Tree (Btrees) Files A popular file structure supported by most DBMSs. Btree provides good performance on both sequential search and key search. DBMS and Applications 15

Properties of Btrees Balanced: all leaf nodes reside on the same level Bushy: the number of branches is large, 50-200 Block-oriented: each node in a tree is a block Dynamic: the shape of a Btree is changed as logic records are inserted and deleted Ubiquitous: the Btree is a widely implemented and used file strcture. DBMS and Applications 16

Structure of a Btree of Height 3 DBMS and Applications 17

Btree Node Containing Keys and Pointers DBMS and Applications 18

Btree Insertion Examples DBMS and Applications 19

Btree Deletion Examples DBMS and Applications 20

B+ Tree: The Most Widely-used Index Index Entries (Direct search) Data Entries ("Sequence set") • Supports equality and range-searches efficiently. • Minimum 50% occupancy (except for root). • Each node contains d <= m <= 2d entries. • The parameter d is called the order of the tree. DBMS and Applications 21

Root 33 24 14 19 39* 22* 24* 27* 38* 3* 5* 19* 20* 29* 33* 34* 2* 7* 14* 16* An Example of a B+ Tree • Search begins at root, and key comparisons direct it to a leaf. • Search for 5*, 15*, all data entries >= 24* ... • Based on the search for 15* in the appropriate leaf, we know it is not in the tree! DBMS and Applications 22

Inserting a Data Entry into a B+ Tree • Find correct leaf L. • Put data entry onto L. • If L has enough space, done! • Else, must splitL (into L and a new node L2) • Redistribute entries evenly, copy upmiddle key. • Insert index entry (into parent of L) pointing to L2. • This can happen recursively • To split index node, redistribute entries evenly, but push upmiddle key. (Contrast with leaf splits.) • Splits “grow” tree; root split increases height. • Tree growth: gets wider or one level taller at top. DBMS and Applications 23

Entry to be inserted in parent node. Entry to be inserted in parent node. (Note that 19 is pushed up and Only appears once in the index. Contrast this with a leaf split.) (Note that 5 is s copied up and 5 continues to appear in the leaf.) 19 5* 3* 7* 2* 8* 24 33 5 14 Inserting 8* into Example B+ Tree • Observe how minimum occupancy is guaranteed in both leaf and index pg splits. • Note difference between copy-upand push-up; be sure you understand the reasons for this. DBMS and Applications 24

Root 19 24 33 5 14 39* 2* 3* 19* 20* 22* 24* 27* 38* 5* 7* 8* 29* 33* 34* 14* 16* Example B+ Tree After Inserting 8* • Notice that root was split, leading to increase in height. • In this example, we can avoid split by re-distributing entries; • however, this is usually not done in practice. DBMS and Applications 25

Deleting a Data Entry from a B+ Tree • Start at root, find leaf L where entry belongs. • Remove the entry. • If L is at least half-full, done! • If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, mergeL and sibling. • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. • Merge could propagate to root, decreasing height. DBMS and Applications 26

Root 19 27 33 5 14 39* 2* 3* 19* 24* 27* 29* 38* 5* 7* 8* 33* 34* 14* 16* Example Tree After (Inserting 8*, Then) Deleting 20* and 22* ... • Deleting 20* is easy. • Deleting 22* is done with re-distribution. • Notice how middle key (27) is copied up. DBMS and Applications 27

... And Then Deleting 24* Root 5 14 19 33 33 39* 22* 27* 38* 29* 33* 34* 39* 3* 19* 38* 2* 5* 7* 8* 27* 33* 34* 29* 14* 16* • Must merge leaf nodes. • Observe ‘toss’of index entry 27 (on right), and ‘pull down’of index entry 19 (below). DBMS and Applications 28

Example of Non-leaf Re-distribution 2* 3* 39* 5* 7* 8* 38* 17* 18* 20* 21* 22* 27* 29* 33* 34* 14* 16* Root 22 33 17 20 5 14 • Tree is shown below during deletion of 24*. • In contrast to previous example, can re-distribute entry from left child of root to right child. DBMS and Applications 29

After Re-distribution Root 17 22 33 5 13 20 2* 3* 39* 5* 7* 8* 38* 17* 18* 20* 21* 22* 27* 29* 33* 34* 14* 16* • Intuitively, entries are re-distributed by ‘pushingthrough’the splitting entry in the parent node. • It suffices to re-distribute index entry with key 20; we’ve re-distributed 17 as well for illustration. DBMS and Applications 30

Cost of Operations The height of Btree dominates the number of physical record accesses operation. Logarithmic search cost Upper bound of height: log function’ Log base: minimum number of keys in a node The cost to insert a key = [the cost to locate the nearest key] + [the cost to change nodes]. DBMS and Applications 31

B+Tree Provides improved performance on sequential and range searches. In a B+tree, all keys are redundantly stored in the leaf nodes. To ensure that physical records are not replaced, the B+tree variation is usually implemented. DBMS and Applications 32

Index Matching Determining usage of an index for a query Complexity of condition determines match. Single column indexes: =, <, >, <=, >=, IN <list of values>, BETWEEN, IS NULL, LIKE ‘Pattern’ (meta character not the first symbol) Composite indexes: more complex and restrictive rules DBMS and Applications 33

Bitmap Index Can be useful for stable columns with few values Bitmap: String of bits: 0 (no match) or 1 (match) One bit for each row Bitmap index record Column value Bitmap DBMS converts bit position into row identifier. DBMS and Applications 34

Bitmap Index Example Bitmap Index on FacRank Faculty Table DBMS and Applications 35

Bitmap Join Index Bitmap identifies rows of a related table. Represents a precomputed join Can define for a join column or a non-join column Typically used in query dominated environments such as data warehouses DBMS and Applications 36

Summary of File Structures DBMS and Applications 37

Index Selection Most important decision Difficult decision Choice of clustered and nonclustered indexes DBMS and Applications 38

Clustering Index Example DBMS and Applications 39

Nonclustering Index Example DBMS and Applications 40

Inputs and Outputs of Index Selection DBMS and Applications 41

Trade-offs in Index Selection Balance retrieval against update performance Nonclustering index usage: Few rows satisfy the condition in the query Join column usage if a small number of rows result in child table Clustering index usage: Larger number of rows satisfy a condition than for nonclustering index Use in sort merge join algorithm to avoid sorting More expensive to maintain DBMS and Applications 42

Difficulties of Index Selection Application weights are difficult to specify. Distribution of parameter values needed Behavior of the query optimization component must be known. The number of choices is large. Index choices can be interrelated. DBMS and Applications 43

Selection Rules Rule 1: A primary key is a good candidate for a clustering index. Rule 2: To support joins, consider indexes on foreign keys. Rule 3: A column with many values may be a good choice for a non-clustering index if it is used in equality conditions. Rule 4: A column used in highly selective range conditions is a good candidate for a non-clustering index. DBMS and Applications 44

Selection Rules Rule 5: A frequently updated column is not a good index candidate. Rule 6: Volatile tables (lots of insertions and deletions) should not have many indexes. Rule 7: Stable columns with few values are good candidates for bitmap indexes if the columns appear in WHERE conditions. Rule 8: Avoid indexes on combinations of columns. Most optimization components can use multiple indexes on the same table. DBMS and Applications 45

Index Creation To create the indexes, the CREATE INDEX statement can be used. The word following the INDEX keyword is the name of the index. CREATE INDEX is not part of SQL:1999. Example: DBMS and Applications 46

Foundations of Database Systems