Indexing in DBMSs

Indexing in DBMSs Erik Selberg 590db 4/29/98

Outline • Motivation • Cost Functions & 521 • B+-Trees • ISAM • Unstructured Text & IR • Conclusion

Motivation • Data stored on disk pages in one way • O(n) space • Data can be ordered one way (if at all) • O(log n) or O(1) lookup for one attribute • O(n) lookup for the rest • Make lookups faster • Increase space necessary • What about speed of other operations?

Cost Functions • B data pages on disk • R records per page • O(n) = O(BR) • D I/O time (~25ms) • C CPU time (~1-10ms) • H Hash function time (~1-10ms)

DBMS operations • Scan - fetch all records • Search w/ Equality • Lookups and Modifications • Search w/ Range • Insert • Delete Bulk operations may be amortized!

Baseline Storage • Unorganized (heap) • Sorted • Sorted on one key • Hashed • static hashing using chaining

Unorganized Heaps • Scan BD + BRC • Search = 1/2 (BD + BRC) • Search <> BD + BRC • Insert 2D + C • Delete C + D Challenge: make this worse

Sorted • Scan BD + BRC • Search = D lg B + C lg R • Search <> D lg B + C lg R + # • Insert (D lg B + C lg R) + (BD + BRC) • Delete (D lg B + C lg R) + (BD + BRC) Good for range, crappy for rest

Static Hash w/ Chaining • Scan 1.25(BD + BRC) • Search = H + D + 1/2RC • Search <> 1.25(BD + BRC) • Insert (H + D + 1/2RC) + (C + D) • Delete (H + D + 1/2RC) + (C + D) • Need to grow and shrink hash table • Bad hashes hose you

Cost summary

What’s the best structure if: • You’re Amazon.com. Lots of equality lookups, some bulk insertions. • You’re United. Lots of range lookups. • You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

What is stored in the index? • k key; k* data entry • r1 = (Malone, Karl, 123, 13, 4) • r2 = (Malone, Moses, 456, 16, 5) • k* = data k* = r1 • k* = <k, rid> k* = <Malone, r1> • k* = <k, rid list> k* = <Malone, (r1, r2)>

Index Index Data entries Data Records Clustered Indices • Order date entries in a similar way to data records on disk • Only one clustered index per table

Baker, 4 4 Ellis, 14 5 Foster, 7 7 Baker 7 Hawkins, 9 Hawkins Keefe, 5 Payton 9 Malone, 12 12 Payton, 7 13 Stockton, 13 14 Sparse and Dense Indices • Dense - one entry per record (1-1) • Sparse - one entry per page • Clustered, therefore only one per table • Inverted on a field • Dense secondary index • Fully Inverted • All fields have index

Primary and Secondary Indicies • Primary Index is over the Primary Key • Primary stores data entry as records • Primary has no duplicates • Should only be one • Secondary stores as <k, rid> or <k, rid list>

B-Trees • B is for Balanced (that’s good enough for me) • B-Tree • Each node has d items, at most d+1 children • Balanced tree • B+-Tree • Data at leaves • Leaves doubly-linked

A B+-Tree 20 40 60 80 ... 6 15 30 98 1* 2* 3* 6* 9* 18* 19* 24* 29* 99* • Keys are at leaves • Not all nodes / leaves are full • Common impls keep 50% minimum occupancy

B+-Tree Costs • Assume: d == R • Scan BD + BRC • Search = D lg B + C lg R • Search <> D lg B + # • Insert RCD lg B • Delete RCD lg B • Some extra work to keep balance

Summary + B+-Tree costs

ISAM Trees • Similar to B+-Tree • Not balanced, uses chaining • Faster Insert / Delete, slower Search • Internal nodes are static Good for static DBs and data warehouses

Sparse and Clustered Indices Remember that bit about only one clustered index per table? • Only one clustered index per table • Therefore, only one index has values that can be read sequentially without lots of page requests

How many locks do need to... Insert a new item into DB • Unsorted? • Sorted? • Hash? • B+-Tree? • ISAM?

Unstructured Text • Database => structured data • Schemas • Tables • OLTP • Information Retrieval => unstructured So they don’t have much to do with one another, right?

Karl AND Malone “Karl Malone” Karl NEAR/2 Malone SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D SELECT Docs(D)WHERE “Karl Malone” in D Does this mean “X Y” is a single term? SELECT Docs(D)WHERE …uh…? IR Queries

Position is structure! Karl: par 1, sen 1, word 4 Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7, zone quote Admiral KO’d by Jazz power-forward; Malone fined and suspended. SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain. Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline. "While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said. Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99. Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84. The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games. Structuring Text

IR Queries in SQL • Query: “Karl Malone”, Robinson • Meaning: Docs w/ “Karl Malone” and Robinson • TextIndex(word: string, doc: int, pos: int) • SELECT W1.docFROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

Indexing Issues in IR • Index method: hash table on word • IR folks think about attributes • IR folks munge attributes • elbow* => elbow, elbowing, elbowed, etc. • “to be or not to be” => “” • IR folks create search keys • Malone => Malone, Stockton, Jazz, Sloan, …

IR and DBMSs • IR uses DBMS for low-level storage • e.g. hash table storage • Hash table lookup is only first step • Clustering • Relevance Ranking • Feedback, Expansion, ... • Full SQL not needed • Custom optimized DB performs better

How AltaVista returns so quickly... Hash indexes mean lots of page requests if there are lots of matches... • Trick #1: use memory. • Trick #2: threshold (find 10 pages > 75% rel). • Trick #3: hard time limit. • More users, less CPU time / query • Trick #4: prioritize • Try to find 10 in memory

Summary • Concerned about B, R, not just n • Hash for equality, B+-Tree for range • One index gives good disk performance • IR uses hash indexing • IR stores term information Indexing helps performance, but youstill need to think about what to index!

Indexing in DBMSs