510 likes | 538 Views
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica. Answering Approximate Queries Efficiently. 30,000-Foot View of Info Systems. Data Repository (RDBMS, Search Engines, etc.). Answers matching conditions. Query. Tom.
E N D
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently
30,000-Foot View of Info Systems Data Repository (RDBMS, Search Engines, etc.) Answers matching conditions Query
Tom Find movies starred Samuel Jackson Example: a movie database
How about our governor: Schwarrzenger? The user doesn’t know the exact spelling!
Relaxing Conditions Find movies with a star “similar to” Schwarrzenger.
Errors in the database: • Data often is not clean by itself • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson In general: Gap between Queries and Facts Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in the query • The user doesn’t remember a string exactly • The user unintentionally types a wrong string
What if we don’t want the user to change the query?Answering Queries Approximately Data Repository (RDBMS, Search Engines, etc.) Answers matching conditions approximately Query
Technical Challenges • How to relax conditions? • Name: “Schwarzenegger” vs “Schwarrzenger” • Salary: “in [50K,60K]” vs “in [49K,63K]” • How to answer queries efficiently? • Index structures • Selectivity estimation See our three recent VLDB papers
Rest of the talk • Selectivity estimation of fuzzy predicates • Our approach: SEPIA • Construction and maintenance of SEPIA • Experiments • Other works
Queries with Fuzzy String Predicates • Stars: name similar to “Schwarrzenger” • Employees: SSN similar to “430-87-7294” • Customers: telephone number similar to “412-0964” Database • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance: ed(Schwarrzenger, Schwarzenegger)=2 • Cosine similarity • Jaccard coefficient distance • Soundex • …
Example Similarity Function: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2
Selectivity of Fuzzy Predicates star SIMILARTO ’Schwarrzenger’ • Selectivity: # of records satisfying the predicate
Selectivity Estimation: Problem Formulation star SIMILARTO ’Schwarrzenger’ Input: fuzzy string predicate P(q, δ) A bag of strings Output: # of strings s that satisfy dist(s,q) <= δ
Why Selectivity Estimation? SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1970,1971]; SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1980,1989]; Movies The optimizer needs to know the selectivity of a predicate to decide a good plan.
Using traditional histograms? • No “nice” order for strings • Lexicographical order? • Similar strings could be far from each other: Kammy/Cammy • Adjacent strings have different selectivities: Cathy/Catherine
Outline • Selectivity estimation of fuzzy predicates • Our approach: SEPIA • Overview • Proximity between strings • Estimation algorithm • Construction and maintenance of SEPIA • Experiments • Other works
Our approach: SEPIA Selectivity Estimation of Approximate Predicates Intuition
Proximity between Strings Edit Distance? Not discriminative enough
Edit Vector from s1 to s2 • A vector <I, D, S> • I: # of insertions • D: # of deletions • S: # of substitutions in a sequence of edit operations with their edit distance • Easily computable • Not symmetric • Not unique, but tend to be (ed <= 3 91% unique)
Why Edit Vector? More discriminative
Global PPD Table Proximity Pair Distribution table
Selectivity Estimation: ed(lukas, 2) • Do it for all v2 vectors in each cluster, for all clusters • Take the sum of these contributions
Selectivity Estimation for ed(q,d) • For each cluster Ci • For each v2 in frequency table of Ci • Use (v1,v2,d) to lookup PPD • Take the sum of these f * N • Pruning possible (triangle inequality)
Outline • Selectivity estimation of fuzzy predicates • Our approach: SEPIA • Overview • Proximity between strings • Estimation algorithm • Construction and maintenance of SEPIA • Experiments • Other works
Clustering Strings Two example algorithms • Lexicographic order based. • K-Medoids • Choose initial pivots • Assign strings to its closest pivot • Swap a pivot with another string • Reassign the strings
Number of Clusters It affects: • Cluster quality • Similarity of strings within each cluster • Costs: • Space • Estimation time
Constructing Frequency Tables • For each cluster, group strings based on their edit vector from the pivot • Count the frequency for each group
Constructing PPD Table • Get enough samples of string triplets (q,p,s) • Propose a few heuristics • ALL_RAND • CLOSE_RAND • CLOSE_LEX • CLOSE_UNIQUE
Dynamic Maintenance: Frequency Table Take insertion as an example
Improving Estimation Accuracy • Reasons of estimate errors • Miss hits in PPD. • Inaccurate percentage entries in PPD. • Improvement: use sample fuzzy predicates to analyze their estimation errors
Relative-Error Model • Use the errors to build a model • Use the model to adjust initial estimation
Outline • Motivation: selectivity estimation of fuzzy predicates • Our approach: SEPIA • Overview • Proximity between strings • Estimation algorithm • Construction and maintenance of SEPIA • Experiments • Other works
Data • Citeseer: • 71K author names • Length: [2,20], avg = 12 • Movie records from UCI KDD repository: • 11K movie titles. • Length: [3,80], avg = 35 • Introduced duplicates: • 10% of records • # of duplicates: [1,20], uniform • Final results: • Citeseer: 142K author names • UCI KDD: 23K movie titles
Setting • Test bed • PC: 2.4G P4, 1.2GB RAM, Windows XP • Visual C++ compiler • Query workload: • Strings from the data • String not in the data • Results similar • Quality measurements • Relative error: (fest – freal) / freal • Absolute relative error : |fest – freal | / freal
Clustering Algorithms K-Metoids is better
Quartile distribution of relative errors Data set 1. CLOSE_RAND; 1000 clusters
Other work 1: Relaxing SQL queries with Selections/Joins SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5
Other work 2: Fuzzy predicates on attributes of mixed types SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3; Movies
Mixed-Typed Predicates • String attributes: edit distance • Numeric attributes: absolute numeric difference SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1977| <= 3;
MAT-tree: Intuition • Indexing on two attributes is more effective than two separate indexing structures • Numeric attribute: B-tree • String attribute: tree-based index structure?
MAT-tree: Overview • Tree-based indexing structure: • Each node has MBR for both numeric attribute and string attribute • Compressing strings as a “compressed trie” that fits into a limited space • An edit distance between a string and compressed trie can be computed • Experiments show that MAT-tree is very efficient