1 / 30

Indexing in DBMSs

Indexing in DBMSs. Erik Selberg 590db 4/29/98. Outline. Motivation Cost Functions & 521 B+-Trees ISAM Unstructured Text & IR Conclusion. Motivation. Data stored on disk pages in one way O(n) space Data can be ordered one way (if at all) O(log n) or O(1) lookup for one attribute

Download Presentation

Indexing in DBMSs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing in DBMSs Erik Selberg 590db 4/29/98

  2. Outline • Motivation • Cost Functions & 521 • B+-Trees • ISAM • Unstructured Text & IR • Conclusion

  3. Motivation • Data stored on disk pages in one way • O(n) space • Data can be ordered one way (if at all) • O(log n) or O(1) lookup for one attribute • O(n) lookup for the rest • Make lookups faster • Increase space necessary • What about speed of other operations?

  4. Cost Functions • B data pages on disk • R records per page • O(n) = O(BR) • D I/O time (~25ms) • C CPU time (~1-10ms) • H Hash function time (~1-10ms)

  5. DBMS operations • Scan - fetch all records • Search w/ Equality • Lookups and Modifications • Search w/ Range • Insert • Delete Bulk operations may be amortized!

  6. Baseline Storage • Unorganized (heap) • Sorted • Sorted on one key • Hashed • static hashing using chaining

  7. Unorganized Heaps • Scan BD + BRC • Search = 1/2 (BD + BRC) • Search <> BD + BRC • Insert 2D + C • Delete C + D Challenge: make this worse

  8. Sorted • Scan BD + BRC • Search = D lg B + C lg R • Search <> D lg B + C lg R + # • Insert (D lg B + C lg R) + (BD + BRC) • Delete (D lg B + C lg R) + (BD + BRC) Good for range, crappy for rest

  9. Static Hash w/ Chaining • Scan 1.25(BD + BRC) • Search = H + D + 1/2RC • Search <> 1.25(BD + BRC) • Insert (H + D + 1/2RC) + (C + D) • Delete (H + D + 1/2RC) + (C + D) • Need to grow and shrink hash table • Bad hashes hose you

  10. Cost summary

  11. What’s the best structure if: • You’re Amazon.com. Lots of equality lookups, some bulk insertions. • You’re United. Lots of range lookups. • You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

  12. What is stored in the index? • k key; k* data entry • r1 = (Malone, Karl, 123, 13, 4) • r2 = (Malone, Moses, 456, 16, 5) • k* = data k* = r1 • k* = <k, rid> k* = <Malone, r1> • k* = <k, rid list> k* = <Malone, (r1, r2)>

  13. Index Index Data entries Data Records Clustered Indices • Order date entries in a similar way to data records on disk • Only one clustered index per table

  14. Baker, 4 4 Ellis, 14 5 Foster, 7 7 Baker 7 Hawkins, 9 Hawkins Keefe, 5 Payton 9 Malone, 12 12 Payton, 7 13 Stockton, 13 14 Sparse and Dense Indices • Dense - one entry per record (1-1) • Sparse - one entry per page • Clustered, therefore only one per table • Inverted on a field • Dense secondary index • Fully Inverted • All fields have index

  15. Primary and Secondary Indicies • Primary Index is over the Primary Key • Primary stores data entry as records • Primary has no duplicates • Should only be one • Secondary stores as <k, rid> or <k, rid list>

  16. B-Trees • B is for Balanced (that’s good enough for me) • B-Tree • Each node has d items, at most d+1 children • Balanced tree • B+-Tree • Data at leaves • Leaves doubly-linked

  17. A B+-Tree 20 40 60 80 ... 6 15 30 98 1* 2* 3* 6* 9* 18* 19* 24* 29* 99* • Keys are at leaves • Not all nodes / leaves are full • Common impls keep 50% minimum occupancy

  18. B+-Tree Costs • Assume: d == R • Scan BD + BRC • Search = D lg B + C lg R • Search <> D lg B + # • Insert RCD lg B • Delete RCD lg B • Some extra work to keep balance

  19. Summary + B+-Tree costs

  20. ISAM Trees • Similar to B+-Tree • Not balanced, uses chaining • Faster Insert / Delete, slower Search • Internal nodes are static Good for static DBs and data warehouses

  21. Sparse and Clustered Indices Remember that bit about only one clustered index per table? • Only one clustered index per table • Therefore, only one index has values that can be read sequentially without lots of page requests

  22. How many locks do need to... Insert a new item into DB • Unsorted? • Sorted? • Hash? • B+-Tree? • ISAM?

  23. Unstructured Text • Database => structured data • Schemas • Tables • OLTP • Information Retrieval => unstructured So they don’t have much to do with one another, right?

  24. Karl AND Malone “Karl Malone” Karl NEAR/2 Malone SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D SELECT Docs(D)WHERE “Karl Malone” in D Does this mean “X Y” is a single term? SELECT Docs(D)WHERE …uh…? IR Queries

  25. Position is structure! Karl: par 1, sen 1, word 4 Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7, zone quote Admiral KO’d by Jazz power-forward; Malone fined and suspended. SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain. Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline. "While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said. Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99. Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84. The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games. Structuring Text

  26. IR Queries in SQL • Query: “Karl Malone”, Robinson • Meaning: Docs w/ “Karl Malone” and Robinson • TextIndex(word: string, doc: int, pos: int) • SELECT W1.docFROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

  27. Indexing Issues in IR • Index method: hash table on word • IR folks think about attributes • IR folks munge attributes • elbow* => elbow, elbowing, elbowed, etc. • “to be or not to be” => “” • IR folks create search keys • Malone => Malone, Stockton, Jazz, Sloan, …

  28. IR and DBMSs • IR uses DBMS for low-level storage • e.g. hash table storage • Hash table lookup is only first step • Clustering • Relevance Ranking • Feedback, Expansion, ... • Full SQL not needed • Custom optimized DB performs better

  29. How AltaVista returns so quickly... Hash indexes mean lots of page requests if there are lots of matches... • Trick #1: use memory. • Trick #2: threshold (find 10 pages > 75% rel). • Trick #3: hard time limit. • More users, less CPU time / query • Trick #4: prioritize • Try to find 10 in memory

  30. Summary • Concerned about B, R, not just n • Hash for equality, B+-Tree for range • One index gives good disk performance • IR uses hash indexing • IR stores term information Indexing helps performance, but youstill need to think about what to index!

More Related