440 likes | 460 Views
Learn about indexing mechanisms to speed up data access, types and evaluation metrics of indexes, and the benefits of Index Scans over Table Scans. Understand Dense and Sparse indexes and their impact on storage and query efficiency.
E N D
CS4432: Database Systems II Indexing-Basics
Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Locating Records: Table Scans Select ID, name, address From R Where ID = 1000; Naïve way Table Scan • Open the heap file of relation R • Access each data page • Access each record in each page • Check the condition
Data Page 1 Header Page Data Page 2 Data Page N DIRECTORY Locating Records: Table Scans • Open the heap file of relation R • Access each data page • Access each record in each page • Check the condition Only 1 memory block What is the least amount of memory needed for Table Scan?
Locating Records: Index Scan Select ID, name, address From R Where ID = 1000; • Table Scan is always an existing option • But it is not efficient, especially for large relations • Indexing & Index Scans are more efficient ways • Depends on whether or not you have an index
pointer search-key Basic Concepts • Indexing mechanisms are used to speed up access to desired data. • Search Key - attribute to set of attributes used to look up records in a file. • An index fileconsists of records (called index entries) of the form Search key (search attribute) Select ID, name, address From R Where ID = 1000;
pointer search-key Basic Concepts (Cont’d) • An index fileconsists of records (called index entries) of the form • Index files are typically much smaller than the original file • Types of indexes • Dense vs. Sparse • Primary vs. Secondary • One-Level vs. Multi-Level
Index Evaluation Metrics Savings here • Access time • Insertion time • Deletion time • Space overhead • Access types supported. E.g., • Equality Search ( x = 100): records with a specified value in the attribute • Range Search ( 10 < x < 100): records with an attribute value falling in a specified range of values. Overheads here
Sequential Files & Primary Indexes File where records are ordered on the indexed column
110 90 70 50 30 10 80 60 40 20 120 100 10 30 90 70 50 80 60 40 20 100 Dense Index on Ordered File Sequential File Dense Index • Ordered (Sequential) File • Records are stored sorted based on the indexed attribute • Dense Index • Has one entry for each data tuple
10 70 50 30 110 90 80 100 20 40 60 120 30 10 70 50 90 60 80 40 20 100 Dense Index on Ordered File Sequential File Dense Index #entries in index = #records in file But the index size is much smaller than the file size
Dense Index: Locate Key = 100 • Index Scan • Read each page from the index • Search for key = 100 • Follow the pointer (Record Id) • Index Binary Search • Since all keys are sorted • Read middle page in index • Either you find the key, Or • Move up or down
10 130 90 50 210 170 150 190 30 70 110 230 30 10 70 50 90 60 80 40 20 100 Sparse Index On Ordered File Sequential File Sparse Index • Sparse Index • An entry for only the 1st record in each data block Sparse Index is smaller than a dense index
Sparse Index On Ordered File Can we build a sparse index on unordered file? • Sparse Indexes can be built ONLY on ordered (sequential files)
Sparse Index: Locate Key = 100 • Index Binary Search still works • Either locate the search key in the index, Or • Locate the largest key smaller than your search key • Follow the pointer and check the data block
10 170 330 490 90 250 410 570 Multi-Level Index Sparse 2nd level • 1st level Index file is just a file with sorted keys • We can build a 2nd level index on top of it Is the index file always sorted? Is the 2nd level sparse or dense? Can it be dense?
10 170 330 490 90 250 410 570 Multi-Level Index Sparse 2nd level Is the index file always sorted? Yes Is the 2nd level sparse or dense? Can it be dense? 2nd, 3rd, … levels have to be sparse (otherwise no savings)
Index without Pointers Note :If file is contiguous, then we can omit pointers • Index start with a pointer to the first block, then a list of keys (one for each block) • If we need Key = K3 • 3rd key check the 3rd block • Location: first pointer + (3-1)*1024 lecture #8
Sparse vs. Dense Indexes • Sparse • Less space • Better for insertion • Only for sorted files (or higher-level indexes) • Dense • More space • Must use for unsorted files (secondary indexes) • Can tell if record does not exist without checking the data file
10 40 20 30 10 10 40 20 30 10 45 10 30 30 30 45 20 10 20 30 Files with Duplicate Keys: Dense Index • Entry in the index for each value 10 10 10 10 10 10 20 20 Too much wasted space 20 20 30 30 30 30 30 30
30 10 10 40 40 20 10 20 30 10 30 45 20 10 30 30 30 10 20 45 Files with Duplicate Keys: Dense Index(Compact Design) • Entry in the index for each distinct value 10 20 30 40 • How to locate key 35 • It does not exist in the index and the index is dense • No need to search the data file
40 10 10 20 30 45 20 30 30 10 careful if looking for 20 or 30! Files with Duplicate Keys: Sparse Index • One index entry for the 1st record in each block 10 10 20 30
Sequential (Ordered) File Insertion/Deletion
10 30 50 70 20 40 60 80 Sparse Index: Deletion • Delete record 40 10 30 50 • Index requires no organization • Data block will have some empty space • Good to have 70 90 110 130 150
70 50 10 30 80 20 40 60 40 40 Sparse Index: Deletion • Delete record 30 10 30 50 • The value 30 in the index will change • Record 40 may or may not move 70 90 110 130 150
70 10 30 50 80 20 40 60 Sparse Index: Deletion • Delete records 30 & 40 10 50 30 • In the data file, Block 2 will be deleted • In the index file, do not create empty spaces in the middle • Can have empty spaces at the end 50 70 70 90 110 130 150
70 50 10 80 20 60 40 40 Dense Index: Deletion • Delete record 30 10 • Same ideas and mechanisms • Dense indexes may trigger more updates in the index • Record 40 may or may not move in its data blocks • Index cannot have free slots in the middle 20 30 30 40 40 50 60 70 80
10 30 40 60 20 50 34 • our lucky day! • we have free space • where we need it! Sparse Index: Insertion • Insert record 34 • Good to have free space in each data block • Especially if the file is ordered • DBMSs may keep x%, e.g., 10%, free to make insertions easier 10 30 40 60
10 30 40 60 20 50 15 20 20 30 Sparse Index: Insertion • Insert record 15 Approach 1 (Immediate Organization) • Move the data records within a block or across blocks to make space for the new record 10 30 40 60 Other Cheaper Variations ??
10 30 40 60 25 20 50 Use Of Overflow Blocks • Insert record 25 10 30 overflow blocks (reorganize later...) 40 60 What about inserting 15 instead of 25 ?? • Record 20 will move the overflow bucket • Still index will not change
Insertion, dense index case • Similar • Often more expensive . . .
Remember… • Primary Index is: • An index on the ordering column (the column on which the data file is sorted • Can be dense or sparse • Can be one-level or multi-level Big Advantage • Records having the same key (or adjacent keys) are in the same (or adjacent) data blocks Leads to sequential I/Os
SQL Query Select ID, name, address From R Where ID = 100; Assume an index is built on column ID 2nd-Level Index heap file R heap file 1st-Level Index heap file
Un-Ordered Files & Secondary Indexes File where records are not ordered on the indexed column
Secondary Indexes • The file may be ordered on another column, say Name. • An index on the Name column is primary index (Can be sparse or dense) • An index on any other column, say ID, is called secondary index (has to be dense) Can we build a sparse index on un-ordered column?? No. We must have an index entry for each data record.
90 30 20 100 80 70 40 10 60 50 30 90 20 ... 80 100 Secondary Indexes • Sparse index does not make sense!
90 30 20 100 80 70 40 10 60 50 10 50 20 60 30 70 40 ... Secondary Indexes • An index entry for each data record • Pointers are cause random I/Os (even for same or adjacent values)
30 20 80 100 90 50 70 40 10 60 10 50 10 20 50 60 30 70 90 ... 40 ... Multi-Level Secondary Indexes 2nd-Level Index (Sparse) • 2nd level can be sparse because the 1st level index is a sorted file • Lowest level is dense • Other levels are sparse
20 20 10 10 30 10 40 40 40 40 Duplicate Values & Secondary Indexes
30 20 20 10 10 10 40 40 40 40 40 10 20 10 30 40 10 40 ... 20 40 Option 1: Follow the Rules Repeated keys can be many • Problem: • excess overhead! • disk space • search time
10 20 20 10 30 40 40 40 10 40 40 20 30 Option 2: Variable-Size Index Entries 10 Variable size records • Problem: • Harder to store • Slower to read • More metadata Information
20 20 10 10 30 10 40 40 40 40 Option 3: Indirection 10 20 Can we build a 2nd level index now? How? 30 40 .. .. .. .. One entry for each distinct key • Each distinct value stored once (Saves space) • Each value points to a bucket of pointers to the duplicate values One entry for each data record
A secondary index (with record pointers) on a nonkey field implemented using one level of indirection so that index entries are of fixed length and have unique field values. Example
A Two-Level Primary Index Example