1 / 165

Temple University – CIS Dept. CIS616– Principles of Data Management

Learn about indexing and hashing techniques in data management, including primary and secondary indices, ordered and hash indices, and B-trees.

gguadalupe
Download Presentation

Temple University – CIS Dept. CIS616– Principles of Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temple University – CIS Dept.CIS616– Principles of Data Management V. Megalooikonomou Indexing and Hashing (based on notes by Silberchatz, Korth, and Sudarshan and notes by C. Faloutsos at CMU)

  2. General Overview - rel. model • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing

  3. Indexing- overview • primary / secondary indices • index-sequential (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing

  4. Basic Concepts • Indexing mechanisms speed up access to desired data • E.g., author catalog in library • Search Key - attribute or set of attributes used to look up records • An index fileconsists of records (called index entries) of the form • Index files typically much smaller than the original file • Two basic kinds of indices: • Ordered indices: search keys are stored in sorted order • Hash indices: search keys are distributed uniformly across “buckets” using a “hash function” search-key pointer

  5. Indexing • once the records are stored in a file, how do you search efficiently? (e.g., ssn=123?)

  6. Indexing • once the records are stored in a file, how do you search efficiently? • brute force: retrieve all records, report the qualifying ones • better:use indices (pointers) to locate the records directly

  7. Indexing – main idea:

  8. Measuring ‘goodness’ • retrieval time? • insertion / deletion? • space overhead? • reorganization? • range queries?

  9. Main concepts • search keys are sorted in the index file and point to the actual records • primary vs. secondary indices • Clustering (sparse) vs non-clustering (dense) indices

  10. Indexing Primary key index: on primary key (no duplicates)

  11. Indexing secondary key index: duplicates may exist Address-index

  12. Indexing secondary key index: typically, with ‘postings lists’ Postings lists

  13. Main concepts – cont’d • Clustering (= sparse) index: records are physically sorted on that key (and not all key values are needed in the index) • Non-clustering (=dense) index: the opposite • E.g.:

  14. Indexing- Sparse index Clustering/sparse index on ssn >=123 >=456

  15. Sparse Index Files • Sparse Index: contains index records for only some search-key values • Applicable when records are sequentially ordered on search-key • To locate a record with search-key value K we: • Find index record with largest search-key value < K • Search file sequentially starting at the record to which the index record points • Less space and less maintenance overhead for insertions and deletions • Generally slower than dense index for locating records • Good tradeoff: sparse index with an index entryfor every block in file, corresponding to least search-key value in the block

  16. Indexing – Dense Index Non-clustering / dense index

  17. Summary • All combinations are possible… • at most one sparse/clustering index • as many as desired dense indices • usually: one primary-key index (maybe clustering) and a few secondary-key indices (non-clustering)

  18. Indexing- overview • primary / secondary indices • index-sequential (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing

  19. ISAM • What if index is too large to search sequentially?  use a multilevel index…

  20. >=123 >=456 block ISAM

  21. ISAM - observations • if index is too large, store it on disk and keep index-on-the-index • usually two levels of indices, one first-level entry per disk block (why? )

  22. ISAM - Multilevel Index

  23. >=123 >=456 124; peterson; fifth ave. ISAM - observations • What about insertions/deletions?

  24. ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. Problems?

  25. ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. • overflow chains may become very long - what to do?

  26. ISAM - observations • What about insertions/deletions? overflows 124; peterson; fifth ave. • overflow chains may become very long - thus: • shut-down & reorganize • start with ~80% utilization

  27. So far • … indices (like ISAM) suffer in the presence of frequent updates • sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive • each record access may fetch a new block from disk • alternative indexing structure: B - trees

  28. Overview • primary / secondary indices • multilevel (ISAM) • B - trees, B+ - trees • hashing • static hashing • dynamic hashing

  29. B-trees • the most successful family of index schemes (B-trees, B+-trees, B*-trees) • can be used for primary/secondary, clustering/non-clustering index • they are balanced “n-way” search trees

  30. B-trees • Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required • Advantage of B+-treeindex files: • automatic self-reorganization with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required • Disadvantage of B+-trees: • extra insertion and deletion overhead, space overhead • Advantages of B+-trees outweigh disadvantages, and they are used extensively

  31. 6 9 <6 >9 <9 >6 3 1 7 13 B-trees E.g., B-tree of order 3 (i.e., at most 3 pointers from each node):

  32. pn p1 … vn-1 v1 v2 B-tree properties: • each node, in a B-tree of order n : • key order • at most n pointers • at least n/2 pointers (except root) • all leaves at the same level • if number of pointers is k, then node has exactly k-1 keys

  33. Properties • “block aware” nodes: each node -> disk page • O(log (N)) for everything! (ins/del/search) • typically, if N = 50 - 100, then 2 - 3 levels • utilization >= 50%, guaranteed; on average 69%

  34. 6 9 <6 >9 <9 >6 3 1 7 13 Queries • Algorithm for exact match query? (e.g., ssn=8?)

  35. Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13

  36. Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13

  37. Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 >9 <9 >6 3 1 7 13

  38. Queries • Algorithm for exact match query? (e.g., ssn=8?) 6 9 <6 H steps (= disk accesses) >9 <9 >6 3 1 7 13

  39. Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )

  40. 6 9 <6 >9 <9 >6 3 1 7 13 Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (e.g., salary ~ 8 )

  41. 6 9 <6 >9 <9 >6 3 1 7 13 Queries • what about range queries? (e.g., 5<salary<8) • Proximity/ nearest neighbor searches? (eg., salary ~ 8 )

  42. B-trees: Insertion • Insert in leaf; on overflow, push middle up (recursively) • split: preserves B - tree properties

  43. 6 9 <6 >9 <9 >6 3 1 7 13 B-trees Easy case: Tree T0; insert ‘8’

  44. 6 9 <6 >9 <9 >6 3 1 7 13 B-trees Tree T0; insert ‘8’ 8

  45. 6 9 <6 >9 <9 >6 3 1 7 13 B-trees Hardest case: Tree T0; insert ‘2’ 2

  46. 13 B-trees Hardest case: Tree T0; insert ‘2’ 6 9 2 1 3 7 push middle up

  47. 2 13 1 3 B-trees Hardest case: Tree T0; insert ‘2’ Ovf; push middle 2 6 9 7

  48. 13 1 3 B-trees Hardest case: Tree T0; insert ‘2’ 6 Final state 9 2 7

  49. B-trees - insertion • Q: What if there are two middles? (e.g., order 4) • A: either one is fine

  50. B-trees: Insertion • Insert in leaf; on overflow, push middle up (recursively – ‘propagate split’) • split: preserves all B - tree properties (!!) • notice how it grows: height increases when root overflows & splits • Automatic, incremental re-organization (contrast with ISAM!)

More Related