CS4432: Database Systems II

CS4432: Database Systems II Indexing-Basics

Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Locating Records: Table Scans Select ID, name, address From R Where ID = 1000; Naïve way  Table Scan • Open the heap file of relation R • Access each data page • Access each record in each page • Check the condition

Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Locating Records: Table Scans • Open the heap file of relation R • Access each data page • Access each record in each page • Check the condition Only 1 memory block What is the least amount of memory needed for Table Scan?

Locating Records: Index Scan Select ID, name, address From R Where ID = 1000; • Table Scan is always an existing option • But it is not efficient, especially for large relations •  Indexing & Index Scans are more efficient ways • Depends on whether or not you have an index

pointer search-key Basic Concepts • Indexing mechanisms are used to speed up access to desired data. • Search Key - attribute to set of attributes used to look up records in a file. • An index fileconsists of records (called index entries) of the form Search key (search attribute) Select ID, name, address From R Where ID = 1000;

pointer search-key Basic Concepts (Cont’d) • An index fileconsists of records (called index entries) of the form • Index files are typically much smaller than the original file • Types of indexes • Dense vs. Sparse • Primary vs. Secondary • One-Level vs. Multi-Level

Index Evaluation Metrics Savings here • Access time • Insertion time • Deletion time • Space overhead • Access types supported. E.g., • Equality Search ( x = 100): records with a specified value in the attribute • Range Search ( 10 < x < 100): records with an attribute value falling in a specified range of values. Overheads here

Sequential Files & Primary Indexes File where records are ordered on the indexed column

110 90 70 50 30 10 80 60 40 20 120 100 10 30 90 70 50 80 60 40 20 100 Dense Index on Ordered File Sequential File Dense Index • Ordered (Sequential) File • Records are stored sorted based on the indexed attribute • Dense Index • Has one entry for each data tuple

10 70 50 30 110 90 80 100 20 40 60 120 30 10 70 50 90 60 80 40 20 100 Dense Index on Ordered File Sequential File Dense Index #entries in index = #records in file But the index size is much smaller than the file size

Dense Index: Locate Key = 100 • Index Scan • Read each page from the index • Search for key = 100 • Follow the pointer  (Record Id) • Index Binary Search • Since all keys are sorted • Read middle page in index • Either you find the key, Or • Move up or down

10 130 90 50 210 170 150 190 30 70 110 230 30 10 70 50 90 60 80 40 20 100 Sparse Index On Ordered File Sequential File Sparse Index • Sparse Index • An entry for only the 1st record in each data block Sparse Index is smaller than a dense index

Sparse Index On Ordered File Can we build a sparse index on unordered file? • Sparse Indexes can be built ONLY on ordered (sequential files)

Sparse Index: Locate Key = 100 • Index Binary Search still works • Either locate the search key in the index, Or • Locate the largest key smaller than your search key • Follow the pointer and check the data block

10 170 330 490 90 250 410 570 Multi-Level Index Sparse 2nd level • 1st level Index file is just a file with sorted keys • We can build a 2nd level index on top of it Is the index file always sorted? Is the 2nd level sparse or dense? Can it be dense?

10 170 330 490 90 250 410 570 Multi-Level Index Sparse 2nd level Is the index file always sorted? Yes Is the 2nd level sparse or dense? Can it be dense? 2nd, 3rd, … levels have to be sparse (otherwise no savings)

Index without Pointers Note :If file is contiguous, then we can omit pointers • Index start with a pointer to the first block, then a list of keys (one for each block) • If we need Key = K3 • 3rd key  check the 3rd block • Location: first pointer + (3-1)*1024 lecture #8

Sparse vs. Dense Indexes • Sparse • Less space • Better for insertion • Only for sorted files (or higher-level indexes) • Dense • More space • Must use for unsorted files (secondary indexes) • Can tell if record does not exist without checking the data file

10 40 20 30 10 10 40 20 30 10 45 10 30 30 30 45 20 10 20 30 Files with Duplicate Keys: Dense Index • Entry in the index for each value 10 10 10 10 10 10 20 20 Too much wasted space 20 20 30 30 30 30 30 30

30 10 10 40 40 20 10 20 30 10 30 45 20 10 30 30 30 10 20 45 Files with Duplicate Keys: Dense Index(Compact Design) • Entry in the index for each distinct value 10 20 30 40 • How to locate key 35 • It does not exist in the index and the index is dense • No need to search the data file

40 10 10 20 30 45 20 30 30 10 careful if looking for 20 or 30! Files with Duplicate Keys: Sparse Index • One index entry for the 1st record in each block 10 10 20 30

Sequential (Ordered) File Insertion/Deletion

10 30 50 70 20 40 60 80 Sparse Index: Deletion • Delete record 40 10 30 50 • Index requires no organization • Data block will have some empty space • Good to have 70 90 110 130 150

70 50 10 30 80 20 40 60 40 40 Sparse Index: Deletion • Delete record 30 10 30 50 • The value 30 in the index will change • Record 40 may or may not move 70 90 110 130 150

70 10 30 50 80 20 40 60 Sparse Index: Deletion • Delete records 30 & 40 10 50 30 • In the data file, Block 2 will be deleted • In the index file, do not create empty spaces in the middle • Can have empty spaces at the end 50 70 70 90 110 130 150

70 50 10 80 20 60 40 40 Dense Index: Deletion • Delete record 30 10 • Same ideas and mechanisms • Dense indexes may trigger more updates in the index • Record 40 may or may not move in its data blocks • Index cannot have free slots in the middle 20 30 30 40 40 50 60 70 80

10 30 40 60 20 50 34 • our lucky day! • we have free space • where we need it! Sparse Index: Insertion • Insert record 34 • Good to have free space in each data block • Especially if the file is ordered • DBMSs may keep x%, e.g., 10%, free to make insertions easier 10 30 40 60

10 30 40 60 20 50 15 20 20 30 Sparse Index: Insertion • Insert record 15 Approach 1 (Immediate Organization) • Move the data records within a block or across blocks to make space for the new record 10 30 40 60 Other Cheaper Variations ??

10 30 40 60 25 20 50 Use Of Overflow Blocks • Insert record 25 10 30 overflow blocks (reorganize later...) 40 60 What about inserting 15 instead of 25 ?? • Record 20 will move the overflow bucket • Still index will not change

Insertion, dense index case • Similar • Often more expensive . . .

Remember… • Primary Index is: • An index on the ordering column (the column on which the data file is sorted • Can be dense or sparse • Can be one-level or multi-level Big Advantage • Records having the same key (or adjacent keys) are in the same (or adjacent) data blocks Leads to sequential I/Os

Back to Bigger Picture

SQL Query Select ID, name, address From R Where ID = 100; Assume an index is built on column ID 2nd-Level Index heap file R heap file 1st-Level Index heap file

Un-Ordered Files & Secondary Indexes File where records are not ordered on the indexed column

Secondary Indexes • The file may be ordered on another column, say Name. • An index on the Name column is primary index (Can be sparse or dense) • An index on any other column, say ID, is called secondary index (has to be dense) Can we build a sparse index on un-ordered column?? No. We must have an index entry for each data record.

90 30 20 100 80 70 40 10 60 50 30 90 20 ... 80 100 Secondary Indexes • Sparse index does not make sense!

90 30 20 100 80 70 40 10 60 50 10 50 20 60 30 70 40 ... Secondary Indexes • An index entry for each data record • Pointers are cause random I/Os (even for same or adjacent values)

30 20 80 100 90 50 70 40 10 60 10 50 10 20 50 60 30 70 90 ... 40 ... Multi-Level Secondary Indexes 2nd-Level Index (Sparse) • 2nd level can be sparse because the 1st level index is a sorted file • Lowest level is dense • Other levels are sparse

20 20 10 10 30 10 40 40 40 40 Duplicate Values & Secondary Indexes

30 20 20 10 10 10 40 40 40 40 40 10 20 10 30 40 10 40 ... 20 40 Option 1: Follow the Rules Repeated keys can be many • Problem: • excess overhead! • disk space • search time

10 20 20 10 30 40 40 40 10 40 40 20 30 Option 2: Variable-Size Index Entries 10 Variable size records • Problem: • Harder to store • Slower to read • More metadata Information

20 20 10 10 30 10 40 40 40 40 Option 3: Indirection 10 20 Can we build a 2nd level index now? How? 30 40 .. .. .. .. One entry for each distinct key • Each distinct value stored once (Saves space) • Each value points to a bucket of pointers to the duplicate values One entry for each data record

A secondary index (with record pointers) on a nonkey field implemented using one level of indirection so that index entries are of fixed length and have unique field values. Example

A Two-Level Primary Index Example

CS4432: Database Systems II