1.41k likes | 1.57k Views
CS232A: Database System Principles Indexing. Goal of Indexing. Given condition(s) on attribute(s) find qualified records Attr = value Condition may also be Attr>value Attr>=value. Qualified records. ?. value. value. value. Indexes (or Indices).
E N D
Goal of Indexing Given condition(s) on attribute(s) find qualified records Attr = value Condition may also be • Attr>value • Attr>=value Qualified records ? value value value
Indexes (or Indices) • Data Stuctures used for quickly locating tuples that meet a specific type of condition • Equality condition: find Movie tuples where Director=X • Other conditions possible, eg, range conditions: find Employee tuples where Salary>40 AND Salary<50 • Many types of indexes. Evaluate them on • Access time • Insertion/Deletion time • Condition types • Disk Space needed
Topics • Conventional indexes • B-trees • Hashing schemes
Terms and Distinctions A Dense Primary Index • Primary index • the index on the attribute (a.k.a. search key) that determines the sequencing of the table • Secondary index • index on any other attribute • Dense index • every value of the indexed attribute appears in the index • Sparse index • many values do not appear Sequential File
Dense and Sparse Primary Indexes Dense Primary Index Sparse Primary Index Find the index record with largest value that is less or equal to the value we are looking. + can tell if a value exists without accessing file + better access to overflow records + less index space more + and - in a while
Sparse vs. Dense Tradeoff • Sparse: Less index space per record can keep more of index in memory • Dense: Can tell if any record exists without accessing file (Later: • sparse better for insertions • dense needed for secondary indexes)
Multi-Level Indexes • Treat the index as a file and build an index on it • “Two levels are usually sufficient. More than three levels are rare.” • Q: Can we build a dense second level index for a dense index ?
A Note on Pointers • Record pointers consist of block pointer and position of record in the block • Using the block pointer only saves space at no extra disk accesses cost
Representation of Duplicate Values in Primary Indexes • Index may point to first instance of each value only
Deletion from Dense Index Delete 40, 80 • Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file • Q: What about deletion from dense primary index with duplicates Lists of available entries
Deletion from Sparse Index Delete 40 • if the deleted entry does not appear in the index do nothing
Deletion from Sparse Index (cont’d) Delete 30 • if the deleted entry does not appear in the index do nothing • if the deleted entry appears in the index replace it with the next search-key value • comment: we could leave the deleted value in the index assuming that no part of the system may assume it still exists without checking the block
Deletion from Sparse Index (cont’d) Delete 40, then 30 • if the deleted entry does not appear in the index do nothing • if the deleted entry appears in the index replace it with the next search-key value • unless the next search key value has its own index entry. In this case delete the entry
Insertion in Sparse Index Insert 35 • if no new block is created then do nothing
Insertion in Sparse Index Insert 35 • if no new block is created then do nothing • Or change index value 35
Insertion in Sparse Index Insert 15 • if no new block is created then do nothing or change index value • else create overflow record • Reorganize periodically • Could we claim space of next block? • How often do we reorganize and how much expensive it is? • B-trees offer convincing answers
30 20 80 100 90 50 70 40 10 60 Secondary indexes Sequence field File not sorted on secondary search key
100 30 20 80 90 10 40 60 50 70 90 30 ... 20 80 100 does not make sense! Secondary indexes Sequence field • Sparse index
90 30 20 80 100 50 70 40 10 60 50 10 10 50 20 60 90 30 70 40 ... ... sparse high level Secondary indexes Sequence field • Dense index First level has to be dense, next levels are sparse (as usual)
20 20 10 10 30 10 40 40 40 40 Duplicate values & secondary indexes
30 20 20 10 10 10 40 40 40 40 40 10 20 10 30 40 10 40 ... 20 40 Duplicate values & secondary indexes one option... • Problem: • excess overhead! • disk space • search time
10 20 20 10 30 40 40 40 10 40 20 30 40 Duplicate values & secondary indexes another option: lists of pointers 10 Problem: variable size records in index!
10 20 20 10 30 10 40 40 40 40 50 10 20 60 ... 30 40 Duplicate values & secondary indexes Yet another idea :Chain records with same key? • Problems: • Need to add fields to records, messes up maintenance • Need to follow chain to know records
10 30 20 20 10 10 40 40 40 40 50 10 20 60 ... 30 40 Duplicate values & secondary indexes buckets
Why “bucket” idea is useful • Enables the processing of queries working • with pointers only. • Very common technique in Information • Retrieval Indexes Records Name: primary EMP (name,dept,year,...) Dept: secondary Year: secondary
Advantage of Buckets: Process Queries Using Pointers Only Find employees of the Toys dept with 4 years in the company SELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4 Year Index Dept Index Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s
cat dog Buckets known as Inverted lists This idea used in text information retrieval Documents ...the cat is fat ... ...my cat and my dog like each other... ...Fido the dog ...
Information Retrieval (IR) Queries • Find articles with “cat” and “dog” • Intersect inverted lists • Find articles with “cat” or “dog” • Union inverted lists • Find articles with “cat” and not “dog” • Subtract list of dog pointers from list of cat pointers • Find articles with “cat” in title • Find articles with “cat” and “dog” within 5 words
Common technique: more info in inverted list cat position location type d1 Title 5 Author 10 Abstract 57 d2 d3 dog Title 100 Title 12
Posting: an entry in inverted list. Represents occurrence of term in article Size of a list: 1 Rare words or (in postings) mis-spellings 106 Common words Size of a posting: 10-15 bits (compressed)
PRODUCT = 1 + ……. = score Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = <1 0 0 1 1 0 0 …> Query= <0 0 1 1 0 0 0 …>
Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words...
Summary of Indexing So Far • Basic topics in conventional indexes • multiple levels • sparse/dense • duplicate keys and buckets • deletion/insertion similar to sequential files • Advantages • simple algorithms • index is sequential file • Disadvantages • eventually sequentiality is lost because of overflows, reorganizations are needed
32 39 38 31 33 34 35 36 overflow area (not sequential) 10 20 Example Index (sequential) continuous free space 30 40 50 60 70 80 90
Outline: • Conventional indexes • B-Trees NEXT • Hashing schemes
NEXT: Another type of index • Give up on sequentiality of index • Try to get “balance”
B+Tree Example n=3 Root 100 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35
Sample non-leaf 57 81 95 to keys to keys to keys to keys < 57 57 k<81 81k<95 95
Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85
In textbook’s notation n=3 Leaf: Non-leaf: 30 35 30 35 30 30
Size of nodes: n+1 pointers n keys (fixed)
Non-root nodes have to be at least half-full • Use at least Non-leaf: (n+1)/2 pointers Leaf: (n+1)/2 pointers to data
n=3 Full node min. node Non-leaf Leaf 120 150 180 30 3 5 11 30 35
B+tree rules tree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”
(3) Number of pointers/keys for B+tree Max Max Min Min ptrs keys ptrsdata keys Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1 Leaf (non-root) n+1 n (n+1)/2 (n+1)/2 Root n+1 n 1 1 Counting sequence pointer also
Insert into B+tree (a) simple case • space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root
32 n=3 100 (a) Insert key = 32 30 3 5 11 30 31
7 3 5 7 n=3 100 (a) Insert key = 7 30 3 5 11 30 31
160 180 160 179 n=3 100 (c) Insert key = 160 120 150 180 180 200 150 156 179