380 likes | 451 Views
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes. Bilal Hawashin , Farshad Fotouhi Traian Marius Truta Department of Computer Science
E N D
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes BilalHawashin, FarshadFotouhiTraian Marius Truta Department of Computer Science Wayne State University Northern Kentucky University
Outlines • What is Similarity Join • Long String Values • Our Contribution • Privacy Preserving Protocol For Long String Values • Experiments and Results • Conclusions/Future Work • Contact Information
Motivation Is Natural Join always suitable?
Similarity Join • Joining a pair of records if they have SIMILAR values in the join attribute. • Formally, similarity join consists of grouping pairs of records whose similarity is greater than a threshold, T. • Studied widely in the literature, and referred to as record linkage, entity matching, duplicate detection, citation resolution, …
Our Previous Contribution: Long String Values (ICDM MMIS10) • The term long string refers to the data type representing any string value with unlimited length. • The term long attribute refers to any attribute of long string data type. • Most tables contain at least one attribute with long string values. • Examples are Paper Abstract, Product Description, Movie Summary, User Comment, … • Most of the previous work studied similarity join on short fields. • In our previous work, we showed that using long attributes as join attributes under supervised learning can enhance the similarity join performance.
Our Paper (Motivation) • Some sources may not allow sharing its whole data in the similarity join process. • Solution: Privacy Preserved Similarity Join. • Using long attributes as join attributes can increase the similarity join accuracy. • Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes. • Most of the current privacy preserved SJ algorithms ignore the semantic similarities among the values.
Problem Formulation • Our goal is to find a Privacy Preserved Similarity Join Algorithm when the join attribute is a long attribute and consider the semantic similarities among such long values.
Our Work Plan • Phase1: Compare multiple similarity methods for long attributes when similarity thresholds are used. • Phase2: Use the best method as part in the privacy preserved SJ protocol.
Phase1: Finding Best SJ Method for Long Strings with Threshold • Candidate Methods: • Diffusion Maps. • Latent Semantic Indexing. • Locality Preserving Projection.
Performance Measurements F1 Measurement: the harmonic mean between recall R and precision P. Where recall is the ratio of the relevant data among the retrieved data, and precision is the ratio of the accurate data among the retrieved data.
Performance Measurements(Cont.) Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation. Operation time is the time needed to apply the semantic method. Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold.
Datasets IMDB Internet Movies Dataset: Movie Summary Field Amazon Dataset: Product Title Product Description
Phase1 Results Finding best dimensionality reduction method using Movie Summary from IMDB Dataset (Left) and Product Descriptions from Amazon (Right).
Phase2 Results Preprocessing Time:
Phase2 Results Operation Time for the best performing methods from phase 1. • Matching Time is negligible.
Our Protocol • Both sources A and B share the Threshold value T to decide similar pairs later.
Our Protocol Source A Source B
TF.IDF Weighting TF.IDF weighting of a term W in a long string value x is given as: where tfw,xis the frequency of the term w in the long string value x, and idfwis , where N is the number of long string values in the relation, and nwis the number of long string values in the relation that contains the term w.
MeanTF.IDF Feature Selection • MeanTF.IDF is an unsupervised feature selection method. • Every feature (term) is assigned a value according to its importance. • The Value of a term feature w is given as Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is the total number of long string values.
Apply MeanTF.IDF on WeightedMaand Get Important Features to Imp_Fea. • Add Random features to Imp_Feato get • Rand_ Imp_Fea. • Rand_ Imp_Feaand Rand_ Imp_Febare • returned to C. • C Finds the intersection and return the • shared important features SF to both A • and B.
Add Random Vectors to SF Rand_Weighted_a
Find Wa (The Kernel) |Wa| = D x D, where D is total number of columns in Rand_Weighteda
Use Diffusion Maps to Find Red_Rand_Weighted_a • [Red_Rand_Weighted_a,Sa,Va,Aa] = Diffusion_Map(Wa , 10, 1, red_dim), red_dim < D • Red_Rand_Weighted_a=Diffusion Map Representation of first row of Wa Diffusion Map Representation of second row of Wa Diffusion Map Representation of third row of Wa
C Finds Pairwise Similarity Between Red_Rand_Weighted_a and Red_Rand_Weighted_b
Matched is returned to both A and B. • A and B remove random vectors from • Matched and share their matrices.
Phase2 Results Effect of adding random columns on the accuracy.
Phase2 Results Effect of adding random columns on the number of suggested matches.
Conclusions • Efficient secure SJ semantic protocol for long string attributes is proposed. • Diffusion maps is the best method (among compared) to semantically join long string attributes when threshold values are used. • Mapping into diffusion maps space and adding random records can hide the original data without affecting the accuracy.
Future Work Potential further works: • Compare diffusion maps with more candidate semantic methods for joining long string attributes. • Study the performance of the protocol on huge databases.
Thank You … Dr. FarshadFotouhi. Dr. Traian Marius Truta.