1.65k likes | 1.66k Views
Learn about indexing and hashing techniques in data management, including primary and secondary indices, ordered and hash indices, and B-trees.
E N D
Temple University – CIS Dept.CIS616– Principles of Data Management V. Megalooikonomou Indexing and Hashing (based on notes by Silberchatz, Korth, and Sudarshan and notes by C. Faloutsos at CMU)
General Overview - rel. model • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing
Indexing- overview • primary / secondary indices • index-sequential (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing
Basic Concepts • Indexing mechanisms speed up access to desired data • E.g., author catalog in library • Search Key - attribute or set of attributes used to look up records • An index fileconsists of records (called index entries) of the form • Index files typically much smaller than the original file • Two basic kinds of indices: • Ordered indices: search keys are stored in sorted order • Hash indices: search keys are distributed uniformly across “buckets” using a “hash function” search-key pointer
Indexing • once the records are stored in a file, how do you search efficiently? (e.g., ssn=123?)
Indexing • once the records are stored in a file, how do you search efficiently? • brute force: retrieve all records, report the qualifying ones • better:use indices (pointers) to locate the records directly
Measuring ‘goodness’ • retrieval time? • insertion / deletion? • space overhead? • reorganization? • range queries?
Main concepts • search keys are sorted in the index file and point to the actual records • primary vs. secondary indices • Clustering (sparse) vs non-clustering (dense) indices
Indexing Primary key index: on primary key (no duplicates)
Indexing secondary key index: duplicates may exist Address-index
Indexing secondary key index: typically, with ‘postings lists’ Postings lists
Main concepts – cont’d • Clustering (= sparse) index: records are physically sorted on that key (and not all key values are needed in the index) • Non-clustering (=dense) index: the opposite • E.g.:
Indexing- Sparse index Clustering/sparse index on ssn >=123 >=456
Sparse Index Files • Sparse Index: contains index records for only some search-key values • Applicable when records are sequentially ordered on search-key • To locate a record with search-key value K we: • Find index record with largest search-key value < K • Search file sequentially starting at the record to which the index record points • Less space and less maintenance overhead for insertions and deletions • Generally slower than dense index for locating records • Good tradeoff: sparse index with an index entryfor every block in file, corresponding to least search-key value in the block
Indexing – Dense Index Non-clustering / dense index
Summary • All combinations are possible… • at most one sparse/clustering index • as many as desired dense indices • usually: one primary-key index (maybe clustering) and a few secondary-key indices (non-clustering)
Indexing- overview • primary / secondary indices • index-sequential (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing
ISAM • What if index is too large to search sequentially? use a multilevel index…
>=123 >=456 block ISAM
ISAM - observations • if index is too large, store it on disk and keep index-on-the-index • usually two levels of indices, one first-level entry per disk block (why? )
>=123 >=456 124; peterson; fifth ave. ISAM - observations • What about insertions/deletions?
ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. Problems?
ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. • overflow chains may become very long - what to do?
ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. • overflow chains may become very long - thus: • shut-down & reorganize • start with ~80% utilization
So far • … indices (like ISAM) suffer in the presence of frequent updates • sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive • each record access may fetch a new block from disk • alternative indexing structure: B - trees
Overview • primary / secondary indices • multilevel (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing
B-trees • the most successful family of index schemes (B-trees, B+-trees, B*-trees) • can be used for primary/secondary, clustering/non-clustering index • they are balanced “n-way” search trees
B-trees • Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required • Advantage of B+-treeindex files: • automatic self-reorganization with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required • Disadvantage of B+-trees: • extra insertion and deletion overhead, space overhead • Advantages of B+-trees outweigh disadvantages, and they are used extensively
6 9 <6 >9 <9 >6 3 1 7 13 B-trees E.g., B-tree of order 3 (i.e., at most 3 pointers from each node):
pn p1 … vn-1 v1 v2 B-tree properties: • each node, in a B-tree of order n : • key order • at most n pointers • at least n/2 pointers (except root) • all leaves at the same level • if number of pointers is k, then node has exactly k-1 keys
Properties • “block aware” nodes: each node -> disk page • O(log (N)) for everything! (ins/del/search) • typically, if N = 50 - 100, then 2 - 3 levels • utilization >= 50%, guaranteed; on average 69%
6 9 <6 >9 <9 >6 3 1 7 13 Queries • Algorithm for exact match query? (e.g., ssn=8?)
Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13
Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13
Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13
Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 H steps (= disk accesses) >9 <9 >6 3 1 7 13
Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )
6 9 <6 >9 <9 >6 3 1 7 13 Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )
6 9 <6 >9 <9 >6 3 1 7 13 Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (eg., salary ~ 8 )
B-trees: Insertion • Insert in leaf; on overflow, push middle up (recursively) • split: preserves B - tree properties
6 9 <6 >9 <9 >6 3 1 7 13 B-trees Easy case: Tree T0; insert ‘8’
6 9 <6 >9 <9 >6 3 1 7 13 B-trees Tree T0; insert ‘8’ 8
6 9 <6 >9 <9 >6 3 1 7 13 B-trees Hardest case: Tree T0; insert ‘2’ 2
13 B-trees Hardest case: Tree T0; insert ‘2’ 6 9 2 1 3 7 push middle up
2 13 1 3 B-trees Hardest case: Tree T0; insert ‘2’ Ovf; push middle 2 6 9 7
13 1 3 B-trees Hardest case: Tree T0; insert ‘2’ 6 Final state 9 2 7
B-trees - insertion • Q: What if there are two middles? (e.g., order 4) • A: either one is fine
B-trees: Insertion • Insert in leaf; on overflow, push middle up (recursively – ‘propagate split’) • split: preserves all B - tree properties (!!) • notice how it grows: height increases when root overflows & splits • Automatic, incremental re-organization (contrast with ISAM!)