CS511 Design of Database Management Systems

CS511Design of Database Management Systems Lecture 08: Generalized Search Trees Kevin C. Chang

Search Trees: Previous Approaches • Specialized search trees (yet another tree!): • redundant code: most trees are very similar • concurrency control, logging/recovery: tricky • Trees for extensible data types: • B-tree for any data with linear ordering • e.g.: index titles (alph. ordering) with B-tree • problem: does not support natural queries • e.g.: WHERE book.title > “database”?

GiST: Generalized Search Tree • General: cover B+-tree, R-tree, etc… • Extensible: • domain-specific data types & queries definable • Easy to extend: six “key methods” for a new tree • Efficient: match specialized trees • Reusable: concurrency, recovery for indexes

Example: Indexing Book Titles • Titles for books: • T1 = “database optimization” • T2 = “web database” • T3 = “complexity of optimization algorithms” • T4 = “algorithms and complexity” • Indexable with (extensible) B+-tree? • linear ordering: T4, T3, T1, T2 • Note: Just an example for demonstrating GiST! • What we will do to index “titles” is not the best and typical way to index “textual data”! --- No notion of fuzzy “relevance”. • stay tuned for text and web search

?? Queries on Title? • Indexing is to help “query” processing • ?? What “predicates” to ask about titles?

Queries on Title • Equality predicates: • WHERE book.title = “web databases” • Containment predicates: • WHERE book.title has “web” • Prefix predicates: • WHERE book.title start-with “web” • RegEx predicates: (generalize all the others) • WHERE book.title like “# web # database”

Extensible B+-Tree for Titles • Observations: • indexed values have linear ordering: T4, T3, T1, T2 • keys simply designate separators: T4, c, T3, d, T1, w, T2 d c w T4: alg. … T3: complexity … T1: database … T2: web …

?? Using B+-Tree: What’s Wrong? • Range queries not sensible: title > “web”? • ?? What predicates can B+tree support: • ? equality, containing, prefix, regex? d c w T4: alg. … T3: complexity … T1: database … T2: web …

GiST: Generalizing Balanced Search Trees • GiST is not universal (just reasonable generalization) • balanced tree of <key, ptr> pairs, keys can overlap • GRE test:R-Tree : B-Tree = ________ : R-Tree • ?? what is the key generalization? key1 key2 … … internal nodes (directory) leaf nodes (linked list)

The Key Generalization: The Key • Key evolution: 1-D separator --> 2-D MBR --> predicates • R-Tree : B-Tree • generalizing key form 1-D line to 2-D area • bounding range to (minimal) bounding region • GiST : R-Tree • generalizing key from 2-D MBR to “predicates” • a predicate that all values v in subtree will satisfy • B-tree keys: • [k1:k2) --> contains([k1:k2), v) • R-tree keys: • (x1,y1,x2,y2) --> contains((x1,y1, x2,y2), v)

?? Gist for Title Indexing: Predicates Must first determine predicates: • What query predicates to support? • equality: equal(v, “web db”) • containing: has(v, “web”) • What key predicate to use? • ? any criteria for choosing key predicates? • ?? what do you suggest?

GiST for Title Indexing: Predicates • Key predicates: Contains(S, v) SL SR {alg, comp, opt} {db, opt, web} SLL SLR SRL SRR {alg, comp} {comp, opt} {db, opt} {db, web} T4: alg. … T3: complexity … T1: database … T2: web …

GiST: Built-in Tree Operations • Search(root R, predicate q) • Insert(root R, entry E, level l) • Delete(root R, entry E)

GiST: Application-Specific Methods Search: • Consistent(E, q): search subtree E for predicate q? Labeling: • Union(E1, …, En): how to label the union of E1, …, En? Categorization: • Penalty(E1, E2): penalty for inserting E2 in subtree E1 • PickSplit(E1, …, En): how to split into two groups of entries Compression: (storage/time tradeoff) • Compress(E): E --> Ec • Decompress(Ec): --> E’ such that E.p implies E’.p

Search Operation: Consistent Method • Search(root R, predicate q): • traverse subtrees where Consistent true • return leaf entries that are consistent

Consistent Method # • Consistent(E, q): • can E.p and q both hold? • Title GiST: • key predicate: p = Contains(S, v) or simply S • e.g., SL = {alg, comp, opt} • e.g., SR = {db, opt, web} • Consistent(SL, has(v, “web”))? • ? how to implement? • Consistent(SR, equals(v, “web database”))? • ? how to implement?

Insert Operation • Insert(root R, entry E, level l) • descend tree along least increase in Penalty • stop at level specified • if there is room at node, insert there • else split according to PickSplit • propagate changes using Union to adjust keys

Title GiST: Insert • ?? Where to insert T5:“complexity of web algorithms” ? SL SR {alg, comp, opt} {db, opt, web} SLL SLR SRL SRR {alg, comp} {comp, opt} {db, opt} {db, web} T4: alg. … T3: complexity … T1: database … T2: web …

Penalty Method • Penality(E1, E2): • penalty for inserting E2 in subtree E1 • Title GiST: • E2 with S ={comp,web, alg} • for T5:“complexity of web algorithms” • ? Where to insert? • root: SL = {alg, comp, opt} vs. SR = {db, opt, web}? • Penalty: • ? how to implement?

PickSplit Method • PickSplit(E1, …, En): • how to split into two groups of entries • Title GiST: • suppose insert put the three in one node: • S1 = {alg, comp} • S2 = {comp, opt} • S3 = {comp, web, alg} (new) • ? how to split {S1, S2, S3} into two? • something similar to R-tree algorithm will do

Union Method • Union(E1, …, En): • label of subtree with E1, …, En • Title GiST: • key predicate: p = Contains(S, v) or simply S • S1 = {alg, comp}, S2 = {comp, opt} • ? combined key = ? • Union(E1=(SL, ptr1), E2=(SR, ptr2)) = ? • ? how to implement?

?? Compress/Decompress Method? Key storage vs. search time tradeoff • Compress(E): E --> Ec • Decompress(Ec): --> E’.p can be broader than E.p • Lossy compression: may need more time for search • Title GiST: • ?? any suggestions?

Title GiST: Compress/Decompress • Example 1: no compression • Compress(E) --> Ec = E • Decompress(Ec) --> E’ = Ec • Example 2: compress by taking word initials • Compress: {algorithm, complexity, optimization} --> {al, co, op} • Decompress: {al, co, op} --> {al*, co*, op*}

GiST: No Magic • It promises (only) what its model is based on • It does not represent all possible index structure: • e.g.: duplicate objects by multiple inserts (R+-tree) • e.g.: support notion of distance and similarity • rather than Boolean based predicates • any more?

Outlook: Indexability • Observation: • the simplest version of the Consistent method is a routine that always returns MAYBE-- which gives you a search tree of no efficiency • Big questions: • what is an index machinery? (analog: turing machine) • how do we characterize “workload”? (analog: languages) • can index always help in search? (analog: decidability, complexity) • what are the performance parameters? (analog: size of input) • what are the performance measure? (analog: time, space complexity) • Initial result: Hellerstein, Koutsoupias and Papadimitriou: On the Analysis of Indexing Schemes, PODS 97

Quote: Research is Reasonable At first, I thought GiST is magic. I thought it would generalize and unify all possible search trees, and thus be the only thing you ever need for indexing. It turns out that there is no magic in GiST. It is simply a reasonable generalization of B/R-tree. There is perhaps no magic in any research. Each step of progress is reasonable-- which I think is a good news because it means reasonable efforts can lead us to substantial progress, just like what our predecessors have already achieved.

What’s Next? • T3: The other silver bullet of relational Q. processing • access path selection: to enable query optimization

End Of Talk

How to Index Sets • S1 = {1, 2, 3, 5, 6, 9} • S2 = {1, 2, 5} • S3 = {0, 5, 6, 9} • S4 = {1, 4, 5, 8} • S5 = {0, 9} • S6 = {3, 5, 6, 7, 8} • S7 = {4, 7, 9}

?? Index Sets {0,1,2,3,5,6,9} {1,3,4,5,6,7,8,9} {0,5,6,9} {1,2,3,5,6,9} {1,3,4,5,6,7,8} {4,7,9} {0,9} {0,5,6,9} {1,2,3,5,6,9} {1,2,5} {1,4,5,8} {3,5,6,7,8} {4,7,9}

?? Index 1-D Vectors: How • data tuple = (starting point, 1-D vector) • ?? How to index? (0,2) (-2,-3) (3,3) (-1,-2) (0,-5)

CS511 Design of Database Management Systems