Indexing Mixed Types for Approximate Retrieval

Liang Jin* UC Irvine Nick Koudas University of Toronto Chen Li*UC Irvine Anthony K.H. Tung National University of Singapore Indexing Mixed Types for Approximate Retrieval VLDB’2005 * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586

Queries with Mixed-Type Predicates SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; • SIMLARTO: • a domain-specific function • returns a similarity value between two strings • Example: edit distance ed(Tom Hanks, Ton Hank) = 2

Errors in databases: • Data is not clean • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson Why fuzzy predicates? Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in queries • User doesn’t remember a string exactly • User types a wrong string

Problem Formulation Given: A query with fuzzy predicates on strings and range predicates on numeric attributes on a single relation Goal:Answer the query efficiently SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5;

Rest of the talk • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

Assumptions • One fuzzy string predicate (edit distance) • One numeric predicate Query: (Qs, δs, Qn, δn) SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; (’Schwarrzenger’, 2, 1980, 5)

Intuition of MAT (Mixed-attribute-type) Tree • “2 > 1 + 1” • One integrated indexing structure is better than • two independent indexing structures on two attributes • Indexing numeric attributes: B-tree or R-tree • Indexing strings as a tree to support fuzzy predicates? MAT tree

Answering a query (Qs, δs, Qn, δn) • Top-down traverse the MAT-tree • At each node, do pruning by checking: • If [Qn – δn, Qn + δn] overlap with the numeric range. • If minEditDistance(Qs, Tn) <= δs.

Challenge • How to represent strings to fit into a limited space • and support fuzzy-predicate pruning Limited space (disk based)

Existing Approaches to Indexing Strings as Trees • M-tree: • Edit distance: metric space • Q-tree • Utilize the q-gram property of strings. • See our paper for details

Representing strings as a trie

Compressing a trie compression • Select k representative nodes (centers). • Each center is in the format of <alphabet,height>. • A compressed trie represents more strings

Minimum edit distance between a string a trie minEditDistace (Qs, Tn)? • Convert a trie to an automaton. • Compute the min distance between a string and an automaton [Myers and Miller, 1989] • Early termination possible

Compressed trie  Automaton • Each node is a state. • Each edge becomes a transition between two states. • For compressed node <Σ, L>, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε. Convert a compressed node <{a,b,c},2> into automaton nodes.

Outline • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

Constructing MAT-tree • Option 1: insert records one by one. • Option 2: • bulk-load records • construct the MAT-tree bottom-up

Compressing a trie • Important: • Accurately represent strings in a limited space. • Minimize “information loss”. • Maintain the pruning power during a traversal. • Three methods: • (1) Reducing # of accepted strings • (2) Keeping accepted strings “clustered” • (3) Combining of (1) and (2)

Method (1): Reducing # of accepted strings • Intuition: • reducing this # makes the compressed trie more accurate • Goodness function: # of accepted strings • Algorithm: “Randomized” • Randomly select k initial centers • Randomly select one of the centers • Randomly select an unselected node • Swap them if it can improve the goodness function • Do certain # of iterations

Method (2): Keeping accepted strings clustered • Intuition: • keeping the accepted strings similar to the original ones by letting them share common prefix. • Place k centers as close to the root as possible. • Algorithm: “BreadthFirst”

Method (3): Combining (1) and (2) • Intuition: • minimize the number of accepted strings, and in the same time maintain their similarity to the originals. • Algorithm: “Bottomup” • Keep shrinking the trie bottom up until we have k nodes • Compress a node that minimizes # of additional strings

Dynamic maintenance Insertion (s, n) • Search the index for (s, n). If it’s not in the index, identify the correct leaf node. • If no overflow: • update the “MBR” of the leaf node and its precedents recursively if necessary. • If overflow: • Split the leaf node and • Construct two compressed tries • Cascade the split to the precedents if necessary. Deletion and Update are handled similarly

Outline • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

Setting • Data • IMDB: 100K movie star records (Name and YOB). • Customers: 50K records (Name and YOB) • Test bed • PC: 2.4G P4, 1.2GB Memory, Windows XP • Visual C++ compiler • Similar results. Report result for IMDB.

Implemented approaches • B-tree • Q-tree • B-tree & Q-tree • BQ-tree • BM-tree • Sequential scan “BBQ-tree”? 

“2 > 1 + 1” An integrated indexing structure is better than two separate indexing structures δs=3, δn=4

Scalability

Effect of numeric threshold δn

Effect of string threshold δs

Dynamic Maintenance: time

Dynamic maintenance: MAT quality

Number of centers • Increasing cluster # may not reduce the running time: pruning power versus computational cost • For BottomUp and BreadthFirst (compared to Randomized) • - Centers close to the root, thus more likely to do early termination

Conclusion • MAT-tree: an efficient indexing structure for queries with mixed-type predicates • Can be efficiently constructed and maintained • Future work: develop a uniform framework to support different kinds of similarity functions The Flamingo Project :http://www.ics.uci.edu/~flamingo/ Q&A?

Indexing Mixed Types for Approximate Retrieval