ICS 421 Spring 2010 Indexing (1)

ICS 421 Spring 2010Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa Lipyeow Lim -- University of Hawaii at Manoa

How to speed up queries? SELECT* FROMSailors WHEREage>40 Array of Sailor Tuples/Records File of Record for Sailors Lipyeow Lim -- University of Hawaii at Manoa

Binary Search Trees • Given search value • if value < node.value, then follow left pointer • Else follow right pointer • How do generalize each index node to an index page ? • How do we generalize this to search pages of records ? 26 20 34 30 41 28 21 18 20 21 21 26 27 28 30 31 34 37 41 45 Lipyeow Lim -- University of Hawaii at Manoa

Indexes • What do we store in the index nodes ? Let k be the key value for an index entry: • Data record with key value k • <k, rid of data record with key value k> • <k, list of rids of data records with key value k> • What kind of queries does the index support? • Range • Point (or equality) Lipyeow Lim -- University of Hawaii at Manoa

Overflow page index entry Indexed Sequential Access Method (ISAM) P K P K P P K m 0 1 2 1 m 2 • Static (m+1)-way Search Tree Non-leaf Pages Leaf Pages Primary pages Lipyeow Lim -- University of Hawaii at Manoa

ISAM: Example Insert new record with age 98 40 • Store data record at the leaf pages • Do we still need the file of record ? 20 33 51 63 age 40 98 63 10 20 51 33 46 55 97 15 27 37 rating sname sid Lipyeow Lim -- University of Hawaii at Manoa

ISAM Facts • File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages allocated, then space for overflow pages. • Index entries: <search key value, page id>; they `direct’ search for data entries, which are in leaf pages. • Search: Start at root; use key comparisons to go to leaf. Cost=O(log F N) ; F = # entries/index pg, N = # leaf pgs • Insert: Find leaf data entry belongs to, and put it there. If full, allocate and put in overflow page • Delete: Find and remove from leaf; if empty overflow page, de-allocate. • Static tree structure: inserts/deletes affect only leaf pages. Lipyeow Lim -- University of Hawaii at Manoa

B+ Tree Index B+ Tree Index • Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) • Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. • Supports equality and range-searches efficiently. Index Entries Data Entries/Leaf Pages (“Sequence Set”) Lipyeow Lim -- University of Hawaii at Manoa

B+ Tree: Search Example 13 17 24 30 • Leaf entries store <key,rid> pairs • What is the order ? • Search for: age=5, age=15, age>=24 14 16 19 20 22 24 27 29 2 3 5 7 33 34 38 39 Lipyeow Lim -- University of Hawaii at Manoa

Inserting a new data entry • Find correct leaf L. • Put data entry onto L. • If L has enough space, done! • Else, must splitL (into L and a new node L2) • Redistribute entries evenly, copy upmiddle key. • Insert index entry pointing to L2 into parent of L. • This can happen recursively • To split index node, redistribute entries evenly, but push upmiddle key. (Contrast with leaf splits.) • Splits “grow” tree; root split increases height. • Tree growth: gets wider or one level taller at top. Lipyeow Lim -- University of Hawaii at Manoa

Example: Insert 8* 17 17 pushed up into parent node 13 17 24 30 14 16 19 20 22 24 27 29 5 2 2 5 24 7 3 30 3 13 5 8 7 33 34 38 39 5 copied up to parent node 5 Lipyeow Lim -- University of Hawaii at Manoa

Deleting a data entry • Start at root, find leaf L where entry belongs. • Remove the entry. • If L is at least half-full, done! • If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • If re-distribution fails, mergeL and sibling. • If merge occurred, must delete entry (pointing to L or sibling) from parent of L. • Merge could propagate to root, decreasing height. Lipyeow Lim -- University of Hawaii at Manoa

Miscellaneous • How do we handle data with duplicates ? • Overflow buckets • Make rid part of the key • Each data entry stores <key, list of rids> • Clustered vsUnclustered indexes Index entries UNCLUSTERED CLUSTERED direct search for data entries Data entries Data entries (Index File) (Data file) Data Records Data Records Lipyeow Lim -- University of Hawaii at Manoa

3* 6* 9* 10* 11* 12* 13* 23* 31* 36* 38* 41* 44* 4* 20* 22* 35* Bulk Loading a B+ Tree • If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. • Bulk Loadingcan be done much more efficiently. • Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. Root Sorted pages of data entries; not yet in B+ tree Lipyeow Lim -- University of Hawaii at Manoa

Bulk Loading (cont.) Root 10 20 • Index entries for leaf pages always entered into right-most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.) • Much faster than repeated inserts, especially when one considers locking! Data entry pages 6 12 23 35 not yet in B+ tree 3* 6* 9* 10* 11* 12* 13* 23* 31* 36* 38* 41* 44* 4* 20* 22* 35* Root 20 10 Data entry pages 35 not yet in B+ tree 6 23 12 38 Lipyeow Lim -- University of Hawaii at Manoa 3* 6* 9* 10* 11* 12* 13* 23* 31* 36* 38* 41* 44* 4* 20* 22* 35*

Creating Indexes • Most DBMS (eg. DB2) supports only B+ tree indexes: CREATE INDEX myIdxONmytable(col1, col3) CREATE UNIQUE INDEX myUniqIdxONmytable(col2, col5) CREATE INDEX myIdxONmytable(col1, col3) CLUSTER • If a primary key is specified in the CREATE TABLE statement, an (unclustered) index is automatically created for the PK. • To create a clustered PK index: • Create table without PK constraint • Create index on PK with cluster option • Alter table to add PK constraint • To get rid of unused indexes: DROP INDEX myIdx; Lipyeow Lim -- University of Hawaii at Manoa

ICS 421 Spring 2010 Indexing (1)