1 / 54

BTrees & Sorting

BTrees & Sorting. 11/3. Indexing. “If you don’t find it in the index, look very carefully through the entire catalog” -- Sears, Roebuck, and Co., Consumers Guide, 1897. Index Motivation. A file contains some records, say products We want faster access to those records

Download Presentation

BTrees & Sorting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BTrees & Sorting 11/3

  2. Indexing “If you don’t find it in the index, look very carefully through the entire catalog” -- Sears, Roebuck, and Co., Consumers Guide, 1897

  3. Index Motivation • A file contains some records, say products • We want faster access to those records • E.g., Give me all products made by Sony • Intuition: Build a second file that organizes the records “by product” to make this faster • NB: we don’t always have to build a second file

  4. Indexes • An indexon a file speeds up selections on the search key fields for the index. • Search key properties • Any subset of fields • isnotthe same as key of a relation On which attributes would you build indexes? Product(name, maker, price)

  5. More precisely • An index contains a collection of data entries, and supports efficient retrieval of all data entries k*with a given key value k. Product(name, maker, price) Sample queries? Indexing is one the most important facilities provided by a database for performance

  6. Operations on an Index • Search: Given a key find all records • More sophisticated variants as well. Why? • Insert /Remove entries • Bulk Load. Why? Real difference between structures: costs of ops determines which index you pick and why

  7. Data File with Several Index Files Equality Query: Age = 12 and Sal = 90? Range Query: Age = 5 and Sal > 5? <Age, Sal> Composite keys in Dictionary Order <Sal, Age> <Age> <Sal>

  8. High-level of Index Structures

  9. Outline • Btrees • Very good for range queries, sorted data • Some old databases only implemented Btrees • Hash Tables • There are variants of this basic structure to deal with IO The data structures we present here are “IO aware”

  10. B+ Trees • Search trees • B does not mean binary! • Idea in B Trees: • make 1 node = 1 physical page • Balanced, height adjusted tree (not the B either) • Idea in B+ Trees: • Make leaves into a linked list (range queries)

  11. B+ Trees Basics Parameter d = the degree Each node has >= d and <= 2d keys (except root) Internal Nodes Keys k < 30 Keys 240<=k Keys 30<=k<120 Keys 120<=k<240 Leaf Nodes Each leaf has >=d and <= 2d keys: Next leaf 40 50 60

  12. d = 2 B+ Tree Example 1. No data in internal nodes. 2. Links between leaf pages. 10 15 18 20 30 40 50 60 65 80 85 90

  13. Searching a B+ Tree • Exact key values: • Start at the root • Proceed down, to the leaf • Range queries: • As above • Then sequential traversal Select name From people Where age = 25 Select name From people Where 20 <= age and age <= 30

  14. B+ Tree Example 30 < 80. K = 30? 30 in [20,60) To the data! 10 15 18 20 30 40 50 60 65 80 85 90

  15. B+ Tree Example 30 < 80. K in [30,85] 30 in [20,60) Use those leaf pointers! To the data! 10 15 18 20 30 40 50 60 65 80 85 90

  16. B+ Tree Design • How large is d? • Example: • Key size = 4 bytes • Pointer size = 8 bytes • Block size = 4096 byes • 2d x 4 + (2d+1) x 8 <= 4096 • d = 170 Observable Universe contains ≈ 1080 atoms. What is height of a B+tree that indexes it? NB: Oracle allows 64k pages TiB is 240 bytes. What is the height to index with 64k Pages?

  17. B+ Trees in Practice • Typical order: 100. Typical fill-factor: 67%. • average fanout = 133 • Typical capacities: • Height 4: 1334 = 312,900,700 records • Height 3: 1333 = 2,352,637 records • Top levels of tree sit in the buffer pool: • Level 1 = 1 page = 8 Kbytes • Level 2 = 133 pages = 1 Mbyte • Level 3 = 17,689 pages = 133 MBytes Typically, pay for one IO!

  18. Insert!

  19. Insertion in a B+ Tree Insert (K, P) • Find leaf where K belongs, insert • If no overflow (2d keys or less), halt • If overflow (2d+1 keys), split node, insert in parent: • If leaf, keep K3 too in right node • When root splits, new root has 1 key only (K3, ) to parent

  20. Insertion in a B+ Tree Insert K=19 10 15 18 20 30 40 50 60 65 80 85 90

  21. Insertion in a B+ Tree After insertion 10 15 18 19 20 30 40 50 60 65 80 85 90

  22. Insertion in a B+ Tree Now insert 25 10 15 18 19 20 30 40 50 60 65 80 85 90

  23. Insertion in a B+ Tree After insertion 10 15 18 19 20 25 30 40 50 60 65 80 85 90

  24. Insertion in a B+ Tree But now have to split ! 10 15 18 19 20 25 30 40 50 60 65 80 85 90

  25. Insertion in a B+ Tree After the split 10 15 18 19 20 25 30 40 50 60 65 80 85 90

  26. B+ Tree Indexes Non-leaf Pages Leaf Pages (Sorted by search key) • Leaf pages containdata entries, and are chained (prev & next) • Non-leaf pages have index entries; only used to direct searches: index entry P K P K P P K m 0 1 2 1 m 2

  27. Example B+ Tree Note how data entries in leaf level are sorted Root • Find 28*? 29*? All > 15* and < 30* • Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. • And change sometimes bubbles up the tree 17 Entries <= 17 Entries > 17 27 5 13 30 39* 2* 3* 5* 7* 8* 22* 24* 27* 29* 38* 33* 34* 14* 16*

  28. Clustered Indexes

  29. Index Classification An index is clustered if the data is ordered the same as the data is laid out.

  30. Clustered vs. Unclustered Index UNCLUSTERED CLUSTERED Index entries direct search (Index File) Data entries Data Records Data Records (Data file) • Clustered (or not) dramatically impacts cost

  31. A Simple Cost Models

  32. Operations on an Index Search: Given a key find all records • More sophisticated variants as well. Real difference between structures: costs of ops which index you pick and why

  33. Cost Model for Our Analysis We ignore CPU costs, for simplicity: • B: The number of data pages • R: Number of records per page • D: (Average) time to read or write disk page • Measuring number of page I/O’s • “Average-case” analysis; based on several simplistic assumptions. • Goal: Good enough to show the overall trends!

  34. Clustered v. Unclustered Fanout of Tree is F. Range query for M entries (100 per page) IOs to search for a single item? Clustered Traversal of the tree: logF(1.5N) Range search Query : logF(1.5N) + ceil(M/100) Unclustered Traversal of the tree: logF(1.5N) Range search Query : logF(1.5N) + M Which of these IOs are random/sequential?

  35. Plugging in Some numbers Clustered: logF(1.5N) + ceil(M/100) ~ 1 Random IO (10ms) Unclustered: logF(1.5N) + M Random IO (M*10ms) If M is 1 then there is no difference! If M is 100,000 records, ~10 minutes vs. 10ms

  36. Takeaway • B+Tree are a workhorse index. • You can write down a cost model. • Databases actually do this! • Clustered v. unclustered is a big deal.

  37. Sorting.

  38. Why Sort? • Data requested in sorted order • e.g., find students in increasing GPAorder • Sorting is first step in bulk loading B+ tree index. A classic problem in computer science!

  39. More reasons to sort… • Sorting useful for eliminating duplicate copies in a collection of records (Why?) • Sort-merge join algorithm involves sorting. • Problem: sort 1Tb of data with 1Gb of RAM. • why not use virtual memory?

  40. Do people care? http://sortbenchmark.org Sort benchmark bares his name

  41. Simplified External Sorts.

  42. Two Ideas behind external sort • I/O optimized: long sequential disk access • Observation: Merging sorted files is easy Sort small sets in memory, then merge.

  43. Phase I: Buffer with 3 Pages Sort 10,33,44,55 Sort it! (Quicksort) Main Memory Phase 1, Per Page 2 IOs (1 Read,1 Write) 44,10,33,55 End: All pages sorted. 18,8,5,30 5,8,18,30

  44. Phase II: Merge 1. Read Pages 10,33,44,55 Main Memory 2. Merge 5,8,18,30 5 5,8,10,18 5,8 10,33,44,55 5,8,18,30

  45. Phase II: Merge 1. Read Pages 10,33,44,55 Main Memory 2. Merge 5,8,18,30 5 5,8,10,18 5,8 10,33,44,55 5,8,18,30

  46. Phase II: Merge 1. Read Pages 10,33,44,55 Main Memory 2. Merge 5,8,18,30 5,8 5,8,10,18 5,8,10 10,33,44,55 5,8,18,30

  47. Phase II: Merge 1. Read Pages 10,33,44,55 Main Memory 2. Merge 5,8,18,30 5,8,10 5,8,10,18 5,8,10,18 3rd Page is filled 10,33,44,55 5,8,18,30

  48. Phase II: Merge 1. Read Pages 10,33,44,55 Main Memory 2. Merge 5,8,18,30 5,8,10,18 5,8,10,18 3. Write Back Keep on Merging! 10,33,44,55 5,8,18,30

  49. 3 Buffer Pages Sort Main Memory Now, runs of length 2. If our file has 16 pages, what is next? 5,8,10,18 30,33,44,55

  50. Phase II: Merge 1. Read Pages 10,33,44,55 Main Memory 2. Merge 5,8,18,30 5,8,10,18 5,8,10,18 3. Write Back We now have 10,33,44,55 5,8,18,30

More Related