Indexing Mixed Types for Approximate Retrieval

Liang Jin* UC Irvine Nick Koudas University of Toronto Chen Li*UC Irvine Anthony K.H. Tung National University of Singapore Indexing Mixed Types for Approximate Retrieval * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586

Queries with Mixed-Type Predicates SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; • SIMLARTO: • a domain-specific function • returns a similarity value between two strings • Example: edit distance ed(Tom Hanks, Ton Hank) = 2

Errors in databases: • Data is not clean • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson Why fuzzy predicates? Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in queries • User doesn’t remember a string exactly • User types a wrong string

Problem Formulation Given: A query with fuzzy predicates on strings and range predicates on numeric attributes on a single relation Goal:Answer the query efficiently SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5;

Rest of the talk • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

Assumptions • One fuzzy string predicate (edit distance) • One numeric predicate Query: (Qs, δs, Qn, δn) SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; (’Schwarrzenger’, 2, 1980, 5)

Intuition of MAT (Mixed-attribute-type) Tree • “2 > 1 + 1” • One integrated indexing structure is better than • two independent indexing structures on two attributes • Indexing numeric attributes: B-tree or R-tree • Indexing strings as a tree to support fuzzy predicates? MAT tree

Answering a query (Qs, δs, Qn, δn) • Top-down traverse the MAT-tree • At each node, do pruning by checking: • If [Qn – δn, Qn + δn] overlap with the numeric range. • If minEditDistance(Qs, Tn) <= δs.

Challenge • How to represent strings to fit into a limited space • and support fuzzy-predicate pruning Limited space (disk based)

Existing Approaches to Indexing Strings as Trees • M-tree: • Edit distance: metric space • Q-tree • Utilize the q-gram property of strings. • See our paper for details

Representing strings as a trie

Compressing a trie compression • Select k representative nodes (centers). • Each center is in the format of <alphabet,height>. • A compressed trie represents more strings

Minimum edit distance between a string a trie minEditDistace (Qs, Tn)? • Convert a trie to an automaton. • Compute the min distance between a string and an automaton [Myers and Miller, 1989] • Early termination possible

Compressed trie  Automaton • Each node is a state. • Each edge becomes a transition between two states. • For compressed node <Σ, L>, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε. Convert a compressed node <{a,b,c},2> into automaton nodes.

Outline • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

Constructing MAT-tree • Option 1: insert records one by one. • Option 2: • bulk-load records • construct the MAT-tree bottom-up

Compressing a trie • Important: • Accurately represent strings in a limited space. • Minimize “information loss”. • Maintain the pruning power during a traversal. • Three methods: • (1) Reducing # of accepted strings • (2) Keeping accepted strings “clustered” • (3) Combining of (1) and (2)

Method (1): Reducing # of accepted strings • Intuition: • reducing this # makes the compressed trie more accurate • Goodness function: # of accepted strings • Algorithm: “Randomized” • Randomly select k initial centers • Randomly select one of the centers • Randomly select an unselected node • Swap them if it can improve the goodness function • Do certain # of iterations

Method (2): Keeping accepted strings clustered • Intuition: • keeping the accepted strings similar to the original ones by letting them share common prefix. • Place k centers as close to the root as possible. • Algorithm: “BreadthFirst”

Method (3): Combining (1) and (2) • Intuition: • minimize the number of accepted strings, and in the same time maintain their similarity to the originals. • Algorithm: “Bottomup” • Keep shrinking the trie bottom up until we have k nodes • Compress a node that minimizes # of additional strings

Dynamic maintenance Insertion (s, n) • Search the index for (s, n). If it’s not in the index, identify the correct leaf node. • If no overflow: • update the “MBR” of the leaf node and its precedents recursively if necessary. • If overflow: • Split the leaf node and • Construct two compressed tries • Cascade the split to the precedents if necessary. Deletion and Update are handled similarly

Outline • Motivation: supporting queries with mixed-type predicates • Our approach: MAT tree • Construction and maintenance of MAT tree • Experiments

Setting • Data • IMDB: 100K movie star records (Name and YOB). • Customers: 50K records (Name and YOB) • Test bed • PC: 2.4G P4, 1.2GB Memory, Windows XP • Visual C++ compiler • Similar results. Report result for IMDB.

Implemented approaches • B-tree • Q-tree • B-tree & Q-tree • BQ-tree • BM-tree • Sequential scan “BBQ-tree”? 

“2 > 1 + 1” An integrated indexing structure is better than two separate indexing structures δs=3, δn=4

Scalability

Effect of numeric threshold δn

Effect of string threshold δs

Dynamic Maintenance: time

Dynamic maintenance: MAT quality

Number of centers • Increasing cluster # may not reduce the running time: pruning power versus computational cost • For BottomUp and BreadthFirst (compared to Randomized) • - Centers close to the root, thus more likely to do early termination

Conclusion • MAT-tree: an efficient indexing structure for queries with mixed-type predicates • Can be efficiently constructed and maintained • Future work: develop a uniform framework to support different kinds of similarity functions The Flamingo Project :http://www.ics.uci.edu/~flamingo/ Q&A?

Backup Slides

Constructing MAT-tree • Option 1: inserting records one by one. • Option 2: bulk-loading data records and constructing the MAT-tree in a bottom-up fashion. • Records are sorted based on one attribute. • Fill pages with records until full. • Calculate the numeric range and the compressed trie for each leaf nodes. • Merge leaf nodes into internal nodes recursively according to desired fanout, until a single root is formed.

Example – Customer Service Call Center Customer calls in Serve the customer Issue a fuzzy query: Name LIKE “Tom Hanks” AND YOB CLOSE to 1958 In this example, the underline system should be able to support fuzzy query on both the string and numeric attributes! Return result

Scalability test (IO)

Indexing Mixed Types for Approximate Retrieval