1 / 138

CS232A: Database System Principles Indexing

CS232A: Database System Principles Indexing. Goal of Indexing. Given condition(s) on attribute(s) find qualified records Attr = value Condition may also be Attr>value Attr>=value. Qualified records. ?. value. value. value. Indexes (or Indices).

Download Presentation

CS232A: Database System Principles Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS232A: Database System PrinciplesIndexing

  2. Goal of Indexing Given condition(s) on attribute(s) find qualified records Attr = value Condition may also be • Attr>value • Attr>=value Qualified records ? value value value

  3. Indexes (or Indices) • Data Stuctures used for quickly locating tuples that meet a specific type of condition • Equality condition: find Movie tuples where Director=X • Other conditions possible, eg, range conditions: find Employee tuples where Salary>40 AND Salary<50 • Many types of indexes. Evaluate them on • Access time • Insertion/Deletion time • Condition types • Disk Space needed

  4. Topics • Conventional indexes • B-trees • Hashing schemes

  5. Terms and Distinctions A Dense Primary Index • Primary index • the index on the attribute (a.k.a. search key) that determines the sequencing of the table • Secondary index • index on any other attribute • Dense index • every value of the indexed attribute appears in the index • Sparse index • many values do not appear Sequential File

  6. Dense and Sparse Primary Indexes Dense Primary Index Sparse Primary Index Find the index record with largest value that is less or equal to the value we are looking. + can tell if a value exists without accessing file + better access to overflow records + less index space more + and - in a while

  7. Sparse vs. Dense Tradeoff • Sparse: Less index space per record can keep more of index in memory • Dense: Can tell if any record exists without accessing file (Later: • sparse better for insertions • dense needed for secondary indexes)

  8. Multi-Level Indexes • Treat the index as a file and build an index on it • “Two levels are usually sufficient. More than three levels are rare.” • Q: Can we build a dense second level index for a dense index ?

  9. A Note on Pointers • Record pointers consist of block pointer and position of record in the block • Using the block pointer only saves space at no extra disk accesses cost

  10. Representation of Duplicate Values in Primary Indexes • Index may point to first instance of each value only

  11. Deletion from Dense Index Delete 40, 80 • Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file • Q: What about deletion from dense primary index with duplicates Lists of available entries

  12. Deletion from Sparse Index Delete 40 • if the deleted entry does not appear in the index do nothing

  13. Deletion from Sparse Index (cont’d) Delete 30 • if the deleted entry does not appear in the index do nothing • if the deleted entry appears in the index replace it with the next search-key value • comment: we could leave the deleted value in the index assuming that no part of the system may assume it still exists without checking the block

  14. Deletion from Sparse Index (cont’d) Delete 40, then 30 • if the deleted entry does not appear in the index do nothing • if the deleted entry appears in the index replace it with the next search-key value • unless the next search key value has its own index entry. In this case delete the entry

  15. Insertion in Sparse Index Insert 35 • if no new block is created then do nothing

  16. Insertion in Sparse Index Insert 35 • if no new block is created then do nothing • Or change index value 35

  17. Insertion in Sparse Index Insert 15 • if no new block is created then do nothing or change index value • else create overflow record • Reorganize periodically • Could we claim space of next block? • How often do we reorganize and how much expensive it is? • B-trees offer convincing answers

  18. 30 20 80 100 90 50 70 40 10 60 Secondary indexes Sequence field File not sorted on secondary search key

  19. 100 30 20 80 90 10 40 60 50 70 90 30 ... 20 80 100 does not make sense! Secondary indexes Sequence field • Sparse index

  20. 90 30 20 80 100 50 70 40 10 60 50 10 10 50 20 60 90 30 70 40 ... ... sparse high level Secondary indexes Sequence field • Dense index First level has to be dense, next levels are sparse (as usual)

  21. 20 20 10 10 30 10 40 40 40 40 Duplicate values & secondary indexes

  22. 30 20 20 10 10 10 40 40 40 40 40 10 20 10 30 40 10 40 ... 20 40 Duplicate values & secondary indexes one option... • Problem: • excess overhead! • disk space • search time

  23. 10 20 20 10 30 40 40 40 10 40 20 30 40 Duplicate values & secondary indexes another option: lists of pointers 10 Problem: variable size records in index!

  24. 10 20 20 10 30 10 40 40 40 40 50 10 20 60 ... 30 40 Duplicate values & secondary indexes    Yet another idea :Chain records with same key?  • Problems: • Need to add fields to records, messes up maintenance • Need to follow chain to know records

  25. 10 30 20 20 10 10 40 40 40 40 50 10 20 60 ... 30 40 Duplicate values & secondary indexes buckets

  26. Why “bucket” idea is useful • Enables the processing of queries working • with pointers only. • Very common technique in Information • Retrieval Indexes Records Name: primary EMP (name,dept,year,...) Dept: secondary Year: secondary

  27. Advantage of Buckets: Process Queries Using Pointers Only Find employees of the Toys dept with 4 years in the company SELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4 Year Index Dept Index Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

  28. cat dog Buckets known as Inverted lists This idea used in text information retrieval Documents ...the cat is fat ... ...my cat and my dog like each other... ...Fido the dog ...

  29. Information Retrieval (IR) Queries • Find articles with “cat” and “dog” • Intersect inverted lists • Find articles with “cat” or “dog” • Union inverted lists • Find articles with “cat” and not “dog” • Subtract list of dog pointers from list of cat pointers • Find articles with “cat” in title • Find articles with “cat” and “dog” within 5 words

  30. Common technique: more info in inverted list cat position location type d1 Title 5 Author 10 Abstract 57 d2 d3 dog Title 100 Title 12

  31. Posting: an entry in inverted list. Represents occurrence of term in article Size of a list: 1 Rare words or (in postings) mis-spellings 106 Common words Size of a posting: 10-15 bits (compressed)

  32. PRODUCT = 1 + ……. = score Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = <1 0 0 1 1 0 0 …> Query= <0 0 1 1 0 0 0 …>

  33. Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words...

  34. Summary of Indexing So Far • Basic topics in conventional indexes • multiple levels • sparse/dense • duplicate keys and buckets • deletion/insertion similar to sequential files • Advantages • simple algorithms • index is sequential file • Disadvantages • eventually sequentiality is lost because of overflows, reorganizations are needed

  35. 32 39 38 31 33 34 35 36 overflow area (not sequential) 10 20 Example Index (sequential) continuous free space 30 40 50 60 70 80 90

  36. Outline: • Conventional indexes • B-Trees  NEXT • Hashing schemes

  37. NEXT: Another type of index • Give up on sequentiality of index • Try to get “balance”

  38. B+Tree Example n=3 Root 100 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35

  39. Sample non-leaf 57 81 95 to keys to keys to keys to keys < 57 57 k<81 81k<95 95

  40. Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

  41. In textbook’s notation n=3 Leaf: Non-leaf: 30 35 30 35 30 30

  42. Size of nodes: n+1 pointers n keys (fixed)

  43. Non-root nodes have to be at least half-full • Use at least Non-leaf: (n+1)/2 pointers Leaf: (n+1)/2 pointers to data

  44. n=3 Full node min. node Non-leaf Leaf 120 150 180 30 3 5 11 30 35

  45. B+tree rules tree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

  46. (3) Number of pointers/keys for B+tree Max Max Min Min ptrs keys ptrsdata keys Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1 Leaf (non-root) n+1 n (n+1)/2 (n+1)/2 Root n+1 n 1 1 Counting sequence pointer also

  47. Insert into B+tree (a) simple case • space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

  48. 32 n=3 100 (a) Insert key = 32 30 3 5 11 30 31

  49. 7 3 5 7 n=3 100 (a) Insert key = 7 30 3 5 11 30 31

  50. 160 180 160 179 n=3 100 (c) Insert key = 160 120 150 180 180 200 150 156 179

More Related