1 / 52

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

This paper presents SEPIA, a method for accurately estimating the selectivity of fuzzy string predicates in large data sets. The approach uses proximity between strings, histograms, and an estimation algorithm. SEPIA can be used for various similarity measurements and is extendable to different types of data.

Download Presentation

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Liang Jin and Chen Li Selectivity Estimation for Fuzzy String Predicates in Large Data Sets Supported by NSF CAREER Award IIS-0238586

  2. Example: a movie database “Find movies starred Schwarrzenger”? Find movies with a star “similar to” Schwarrzenger.

  3. Queries with Fuzzy String Predicates • Stars: name similar to “Schwarrzenger” • Employees: SSN similar to “430-87-7294” • Customers: telephone number similar to “412-0964” • Similar to: • a domain-specific function • returns a similarity value between two strings • Example: edit distance • Ed(s1,s2): minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • ed(Tom Hanks, Ton Hank ) = 2 Database

  4. Selectivity Estimation: Problem Formulation star SIMILARTO ’Schwarrzenger’ Input: fuzzy string predicate P(q, δ) A bag of strings Output: # of strings s that satisfy dist(s,q) <= δ

  5. Why Selectivity Estimation? SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1970,1971]; SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND year BETWEEN [1980,1999]; Movies The optimizer needs to know the selectivity of a predicate to decide a good plan.

  6. Rest of the talk • Motivation: selectivity estimation of fuzzy predicates • Our approach: SEPIA • Proximity between strings • Histograms and estimation algorithm • Construction and maintenance of SEPIA • Experiments

  7. Intuition of SEPIA Selectivity Estimation of Approximate Predicates

  8. Proximity between Strings Edit Distance? Not discriminative enough

  9. Edit Vector from s1 to s2 • A vector <I, D, S> • I: # of insertions • D: # of deletions • S: # of substitutions in a sequence of edit operations with their edit distance

  10. Why Edit Vector? More discriminative

  11. SEPIA histograms: Overview

  12. Frequency table for each cluster

  13. Global PPD Table Proximity Pair Distribution table

  14. SEPIA histograms: summary

  15. Selectivity Estimation: ed(lukas, 2) • Do it for all v2 vectors in each cluster, for all clusters • Take the sum of these contributions

  16. Selectivity Estimation for ed(q,d) • For each cluster Ci • For each v2 in frequency table of Ci • Use (v1,v2,d) to lookup PPD • Take the sum of these f * N • Pruning possible (triangle inequality)

  17. Outline • Motivation: selectivity estimation of fuzzy predicates • Our approach: SEPIA • Proximity between strings • Histograms and estimation algorithm • Construction and maintenance of SEPIA • Experiments

  18. Clustering Strings Two example algorithms • Lexicographic order based. • K-Medoids • Choose initial pivots • Assign strings to its closest pivot • Swap a pivot with another string • Reassign the strings

  19. Number of Clusters It affects: • Cluster quality • Similarity of strings within each cluster • Costs: • Space • Estimation time

  20. Constructing Frequency Tables • For each cluster, group strings based on their edit vector from the pivot • Count the frequency for each group

  21. Constructing PPD Table • Get enough samples of string triplets (q,p,s) • Propose a few heuristics • ALL_RAND • CLOSE_RAND • CLOSE_LEX • CLOSE_UNIQUE

  22. Dynamic Maintenance: Frequency Table Take insertion as an example

  23. Dynamic Maintenance: PPD

  24. Improving Estimation Accuracy • A post-processing step to further improve estimation accuracy • See paper for details.

  25. Outline • Motivation: selectivity estimation of fuzzy predicates • Our approach: SEPIA • Proximity between strings • Histograms and estimation algorithm • Construction and maintenance of SEPIA • Experiments

  26. Data • Citeseer: • 71K author names • Length: [2,20], avg = 12 • Movie records from UCI KDD repository: • 11K movie titles. • Length: [3,80], avg = 35 • Introduced duplicates: • 10% of records • # of duplicates: [1,20], uniform • Final results: • Citeseer: 142K author names • UCI KDD: 23K movie titles

  27. Setting • Test bed • PC: 2.4G P4, 1.2GB RAM, Windows XP • Visual C++ compiler • Query workload: • Strings from the data • String not in the data • Results similar • Quality measurements • Relative error: (fest – freal) / freal • Absolute relative error : |fest – freal | / freal

  28. Quartile distribution of relative errors Data set 1. CLOSE_RAND; 1000 clusters

  29. Number of Clusters

  30. Dynamic Maintenance More results in the paper: • Extension to other similarity functions • More experimental results

  31. Related Work • Traditional histograms • Selectivity estimation for predicates with wildcards: star LIKE “%Hanks%” • Answering fuzzy predicates efficiently (another talk in this conference)

  32. Conclusions • Important to support queries with fuzzy string predicates • SEPIA: provides accurate selectivity estimation • Structures can be efficiently constructed and maintained. • Extendable to various similarity measurements The Flamingo Project :http://www.ics.uci.edu/~flamingo/ Q&A?

  33. BLANK

  34. Backup Slides

  35. Errors in databases: • Data is not clean • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson Why do we care? Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in queries • User doesn’t remember a string exactly • User types a wrong string

  36. Size of histograms • Data set 1 • 1000 clusters • PPD table: 5MB • Frequency tables: 200KB

  37. Constructing PPD table • We want to generate enough sample triplets to cover as many (v1, v2) pairs as possible. • We also want to control the cost of generating the samples and calculation. • Heuristics • ALL_RAND • CLOSE_RAND • CLOSE_LEX • CLOSE_UNIQUE

  38. Populating PPD Table CLOSE_RAND is used

  39. Number of Clusters (con’t) Number of clusters grows with the size of the dataset Fixed number of clusters

  40. Extension to other similarity functions • SEPIA: a general framework for selectivity estimation for fuzzy string predicates. • Key issue in extensions: proximity between strings • Too specific? • Too general? • Example: Jaccard coefficient distance • Proximity between two strings s1 and s2. • G(s,n) is the n-gram set for string s. < |G(s1, n) ^ G(s2, n)|, |G(s1, n) v G(s2, n)|, ed(s1, s2) >

  41. Jaccard Coefficient Distance

  42. Research Issues • Deciding similarity functions • Domain specific • Query processing • Answering a query with fuzzy predicates efficiently • Query optimization • Selectivity estimation

  43. Queries with Fuzzy String Predicates • Stars: name similar to “Schwarrzenger” • Employees: SSN similar to “430-87-7294” • Customers: telephone number similar to “412-0964” Database • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance: ed(Schwarrzenger, Schwarzenegger)=2 • Cosine similarity • Jaccard coefficient distance • Soundex • …

  44. Errors in the database: • Data often is not clean by itself • Especially true in data integration and cleansing Relation S Relation R Star Star Keanu Reeves Keanu Reeves Samuel Jackson Samuel L. Jackson Why do we care? Schwarzenegger Schwarzenegger Samuel Jackson Samuel L. Jackson … … • Errors in the query • The user doesn’t remember a string exactly • The user unintentionally types a wrong string

  45. Selectivity of Fuzzy Predicates star SIMILARTO ’Schwarrzenger’ • Selectivity: # of records satisfying the predicate

  46. Example Similarity Function: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

  47. Using traditional histograms? • No “nice” order for strings • Lexicographical order? • Similar strings could be far from each other: Kammy/Cammy • Adjacent strings have different selectivities: Cathy/Catherine

  48. Edit Vector from s1 to s2 • A vector <I, D, S> • I: # of insertions • D: # of deletions • S: # of substitutions in a sequence of edit operations with their edit distance • Easily computable • Not symmetric • Not unique, but tend to be (ed <= 3  91% unique)

  49. Improving Estimation Accuracy • Reasons of estimate errors • Miss hits in PPD. • Inaccurate percentage entries in PPD. • Improvement: use sample fuzzy predicates to analyze their estimation errors

  50. Relative-Error Model • Use the errors to build a model • Use the model to adjust initial estimation

More Related