1 / 45

Improving Similarity Join Algorithms using Vertical Clustering Techniques

Improving Similarity Join Algorithms using Vertical Clustering Techniques. Lisa Tan Department of Computer Science Computing & Information Technology Wayne State University Sept. 15, 2009. Reason using Similarity Join. Correlate data from different data sources (e.g., data integration)

berit
Download Presentation

Improving Similarity Join Algorithms using Vertical Clustering Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne State University Sept. 15, 2009

  2. Reason using Similarity Join Correlate data from different data sources (e.g., data integration) • Data is often dirty (e.g. typing mistakes) • Abbreviated, incomplete or missing information • Differences in information “formatting” due to the lack of standard conventions (e.g. for addresses)

  3. Example Table R Table S Find records from different datasets that could be the same entity.

  4. Experimental Results – Natural Join

  5. Experimental Results – Similarity Join

  6. Problem Statement for Similarity Join Given a string S called the source and another string T called the target. Allowing a defined number of errors to be presented in the joins, the similarity join is to verify whether or not two strings represent the same real-world entity based on certain methods.

  7. Sample Applications • Finding matching DNA subsequences even after mutations have occurred. • Signal recovery for transmissions over noisy lines. • Searching for spelling/typing errors and finding possible corrections. • Handwriting recognition, virus and intrusion detection.

  8. General Approaches • Attracting different research communities: statistics, artificial intelligence and database. • Statistics refers similarity join as probabilistic record linkage armed at minimizing the probability of misclassification. • Artificial intelligence uses supervised learning to learn the parameters of string edit distance metrics • Database uses knowledge intensive approach, edit distance as a general record match scheme.

  9. General Algorithms on Database Area All the algorithms focus on Edit Distance • Dynamic Programming Algorithms • Automata Algorithms • Bit – Parallelism Algorithms • Filtering Algorithms

  10. Comments on Existing Methods • All above proposed algorithms are based on the generic edit distance function. • Some improve the speed of the dynamic programming method. • Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence. • Current similarity algorithms are under the assumption that join conditions are known and do not consider relevant field in their join conditions • Although there have been many efforts for efficient string similarity join, there is still room for improvement.

  11. Outlines • Motivation • Pre-experimental Results • Proposed Approach • Identify Clustered Join Attributes • Experimental Results • Conclusion

  12. Research Goal Identifying the same real-world entities from multiple heterogeneous databases

  13. Motivation of Clustering Concept • Current similarity algorithms do not consider relevant field concepts. • Clustering concept fits well on relevant field concepts.

  14. Pre-experimental Results

  15. Proposed Approach Our proposed approach takes consideration of clustered related attributes Question: how to identify clustered join attributes?

  16. Clustering Algorithm The rationale behind the clustering is to produce fragments, groups of attribute columns that are closely related.

  17. Identify Clustered Related Attributes • Pre-knowledge of Applications on Data • Attributes Usage Information • Calculate Attribute Affinities • Calculate Clustered Affinities Use Bond Energy Bond (BEA) approach to regroup affinity value • Apply split approach to find clustered related attributes

  18. Computation of Affinities Clustering Clustered Approach - Diagram Logical Accesses Attribute Affinity Matrix Clustered attribute affinity matrix Split Approach Group of Clustered related attributes

  19. Clustered Approach – Con’t • Attribute Usage 1 if attribute Aj is referenced by application qk 0 otherwise • Attribute Affinity • Cluster Affinity permutation to maximize the global affinity measure and results in the grouping of large affinity values with large affinity attributes and small affinity values with small affinity attributes.

  20. Clustered Approach - Example

  21. = - 2 SQ af ( VF ) * af ( VF ) af ( VF , VF ) 1 2 1 2 Split Approach • Split based on access model where af(Vfi) stands for the access frequency for vertical fragment and af(VFi,VFj) stands for the access frequency for queries having at least one attribute in vertical fragment

  22. Split Approach – con’t • on Table 3, for the first possible split {Address} and {Birthday, Name, phone}, SQ=25*35; • for the second possible split {Address, Birthday} and {Name, Phone}, SQ=-(30+35) ; • for the third possible split {Address, Birthday, Name} and {Phone}, SQ=-(35+35).

  23. Existing Similarity Join Techniques • Edit Distance • Q-gram

  24. Similarity Join – Edit Distance • A widely used metric to define string similarity • ED(s1, s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: surgery s2: survey ED(s1, s2) = 2

  25. Programming Algorithm • This is the oldest algorithm. • Answers the question, how do we compute ed(x,y). • Take a matrix C0..|x|,0..|y| where Ci,j is the minimum number of operations to match xi to yj. • This is calculated as follows: • Ci,0 = i • C0,j = j • if (xi = yj) then Ci,j = Ci-1,j-1 • Otherwise, Ci,j = 1 + min(Ci-1,j, Ci,j-1, Ci-1,j-1) • O(mn) complexity.

  26. Matrix Example Edit Distance

  27. Similarity Join - Qgram • Qgram Roadmap: - break strings into substrings of length q - perform an exact join on the q-grams - find candidate string pairs based on the results - check only candidate pairs with a UDF to obtain final answer

  28. Similarity Join – Q-gram Q-gram is a pair of substrings having the properties: • Slide a window of length q over the string s • Add new characters # and % • Generate |s| + q -1 substrings

  29. Q-gram Technique (cont’d) • Rationale : when two strings s and t are within a small edit distance of each other, they share a large number of q-grams in common. • Advantage: build on the top of relational databases with an augmented table created on the fly.

  30. Similarity Join - Qgram • Issue with Qgram: don’t work on the large dataset • Resolution to the issue: - clear the data by using exact join - create a table to hold the dismatching data - apply the Qgram on the new temp table

  31. Similarity Join – Q-gram (continued) • For a string john smith: { (1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n s), (7, sm), (8,smi), (9,mit), (10,ith), (11,th%), (12,h%%)} with q=3 • For a string john a smith: {(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n a), (7, a ), (8,a s), (9, sm), (10,smi), (11,mit), (12,ith), (13,th%), (14,h%%)} with q=3

  32. Sample SQL Expression SELECT R1.A0,, R2.A0, R2.Ai, R2.Aj FROM R1, R1AiQ, R2, R2AjQ WHERE R1A0 = R1AiQ.A0 AND R2A0 = R2AjQ.A0 AND R1AiQ.Qgram = R2AjQ.Qgram AND |R1AiQ.Pos – R2AjQ.Pos| <= k AND |strlen(R1.Ai) – strlen(R2.Aj)| <= k GROUP BY R1.A0, R2.A0, R2.Ai, R2.Aj HAVING COUNT(*) >= strlen(R1.A­i) – 1 – (k-1)*q AND COUNT(*) >= strlen(R2.A­j) – 1 – (k-1)*q AND Edit_distance(R1.Ai, R2.Aj, k)

  33. Precision, Recall and F-measure • Precision is defined as the number of true positives divided by the sum of true positives and false positives (TP/(TP + FP) • Recall is defined as the number of true positives divided by the sum of true positives and false negatives (TP/(TP + FN) • F-measure is defined as the weighted harmonic mean of precision and recall: F = 2 * (precision * recall) / (precision + recall)

  34. Experimental Results • Known join attributes vs clustered join attributes on Precision

  35. Experimental Results • Known join attributes vs clustered join attributes on Recall

  36. Experimental Results • ED vs. Qgram

  37. Experimental Results • ED vs Qgram on Recall

  38. Experimental Results • ED vs Qgram on F-measure

  39. Conclusion • Proposed a pre-processing approach to improve existing similarity join techniques • Experimental results showed improvement of ED by about 5% and Q-gram by about 15%

  40. Future Work Potential further works: • work on alternative clustering method • increase the datasets • add some pre and post filter abilities …

  41. Publications • Lisa Tan, Farshad Fotouhi and William Grosky "Improving Similarity Join Algorithms using Vertical Clustering Techniques", ICADIWT 2009, Page 491 - 496. • Improving Similarity Join Algorithm Using Fuzzy Clustering Techniques has been accepted by ICDM-09 Workshop on Mining Multiple Information Sources (MMIS)

  42. Thank You! Lisa Tan – lisatan@wayne.edu Co-Authors Dr. Farshad Fotouhi – fotouhi@wayne.edu Dr. William Grosky – wgrosky@umich.edu Acknowledgement Dr. Farshad Fotouhi, Dr. William Grosky, and Computing & Information Technology

  43. Question Thanks!

  44. Wayne State University - Facts • 30th largest university in nation • Top 50 in NSF public rankings • Over 33,300 students • Over 350 undergraduate/graduate degree programs in 12 Schools and Colleges

  45. Comments on Existing Methods • All above proposed algorithms are based on the generic edit distance function. • Some improve the speed of the dynamic programming method. • Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence. • Although there have been many efforts for efficient string similarity join, there is still room for improvement.

More Related