330 likes | 525 Views
HmSearch : An Efficient Hamming Distance Query Processing Algorithm. Xiaoyang Zhang 1 , Jianbin Qin 1 , Wei Wang 1 , Yifang Sun 1 , Jiaheng Lu 2. 1 University of New South Wales, Australia 2 Renmin University of China, Chnia. Motivation. Identify Near Duplicate Webpages. Chemical data.
E N D
HmSearch: An Efficient Hamming Distance Query Processing Algorithm Xiaoyang Zhang1,Jianbin Qin1, Wei Wang1,Yifang Sun1, Jiaheng Lu2 1 University of New South Wales, Australia 2 Renmin University of China, Chnia
Motivation • Identify Near Duplicate Webpages • Chemical data Maps in to Binarycode simhash 0012345679ABCDEF 1012345679ABCDEF 012345679ABCDEF0 012345679ABCDEF1 Similar Similar
More Applications • Iris recognition • Image retrieval • C2LSH
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Hamming Distance Query • Hamming distance • Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD Hamming distance(R, S) = 1 v: ACCD Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all vi in V, that hd (vi, Q) <= k
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Basic Idea • General framework: • We can do k=1 efficiently (show later) • So we transform larger k problem to several small k=1 problem by partitioning • We do filtering by looking at each partition • We do verification at last So if k =1, can be filtered by looking at each part q the same v hd (q, v)<=1 hd(qleft, vleft)=0 or hd(qright, vright)=0 q v
Framework Dimension Rearrangement Data Query General Partitioning Scheme Partitioning Partitioning Generating Signatures Generating Signatures 1-variants and 1-deletion variants Candidates0 Indexing Filtering Enhanced Filtering Index Candidates1 Hierarchical Filtering and Verification Verification Results
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Partitioning Lowerbound for partition strategy In our algorithm, we choose Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(qpart, vpart)<= When k= 0 or 1, m=1, hd = 0 When k is even, m = 1 Whenk>=2, hd <= 1 When k is odd, m = 2
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Signature Generation • 1-variants • 1-deletion-variants Substituting each dimension with each domain value each time (plus itself) Substituting each dimension with ‘#’each time v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] OR We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Enhanced Filter (Even) Example Based on the Formula before When k (k>=1) is even, m = 1 q However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(vpart, qpart)=0 m=2, where hd(vpart, qpart)<=1 v If k =2, based on the formula before, m=1, hd(vpart, qpart)=1 So this v becomes a false positive Using enhanced filter, no situation applied sov is filtered
Enhanced Filter (Odd) Example Based on the Formula before When k (k>=1) is odd, m = 2 q However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(vpart, qpart)<=1 and at least one of them = 0 2) m=3, where hd(vpart, qpart)<=1 v If k =3, based on the formula before, m=2, hd(vpart, qpart)=1 So this v becomes a false positive Using enhanced filter, no situation applied sov is filtered
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Hierarchical Filtering and Verification 4 comparisons to calculate hd(v,q)=3 Significant bit v= [5, 0, 3, 6] q= [5, 2, 2, 5] diff So hd(v, q)>=2, filtered 1 0 1 0 XOR 1 0 0 1 0011 3rd OR More over, even if k=4 0110 2nd 0 0 0 1 0 1 1 1 XOR OR 1st 1 0 0 1 XOR 1 0 0 1 0000 0111 hd(v,q)=3 Σ=|8|, k=1 We can use binary operations to do a hierarchical filtering and verification
Hierarchical Filtering and Verification Number of 1 In cumdiff Significant bit v= [5, 0, 3, 6] q= [5, 2, 3, 5] diff cumdiff 0000 OR 1 0 1 0 XOR 1 0 1 1 0001 0001 1 <=1,conti. 3rd OR 0101 2 >1,filtered 0101 2nd 0 0 0 1 XOR 0 1 1 1 1st 1 0 0 1 1 0 0 1
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Impact of Data Skewness Given k=2, then m = 1 and k’=1 Partition1 Partition2 Partition1 Partition2 Dim 1 2 3 4 5 6 Dim 1 2 5 4 3 6 We propose to reset the order andpartition Length to improve performance q 1 1 1 1 0 0 q 1 1 0 1 1 0 v1 1 1 1 0 0 0 v1 1 1 0 0 1 0 v2 0 0 0 2 0 0 v2 0 0 0 2 0 0 v3 2 0 2 0 0 0 v3 2 0 0 0 2 0 v4 3 0 0 0 0 0 v4 3 0 0 0 0 0 Only v1 is qualified All vectors are qualified
Greedy Dimension Rearrangement MaxFreq is the Max Frequency of any values in each dimension MaxFreq for Dim MaxFreq for partition 1 3 3 3 4 4 1 1 4 1 2 4 Partition1 Partition2 Partition1 Partition2 Dim Dim 5 1 2 6 3 4 1 2 3 4 5 6 v1 1 1 1 0 0 0 v1 0 1 1 0 1 0 v2 0 0 0 2 0 0 v2 0 0 0 0 0 2 v3 2 0 2 0 0 0 v3 0 2 0 0 2 0 v4 3 0 0 0 0 0 v4 0 3 0 0 0 0 Our goal: Minimize the global MaxFreq Achieve the goal
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Conclusion • General Partition Scheme • 1-variants and 1-deleltion-variants • Techniques help boost the performance • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement
Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment
Experiment Settings • Environment • Intel Xeon X3330 2.664GHz CPU, 4GB RAM • Debian5.0.6 • AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem) • Ubuntu/Linaro 4.6.4-1 unbuntu5 • All complied with GCC 4.1.2 with –O3 • Dataset
Experiment Settings • Terms • EF, Enhanced Filtering • HB, Hierarchical Binary Filter • RD, Rearranging Dimensions • Our algorithms • HSD, HSV, our proposed algorithms, the former one using 1-deleltion-variants as signatures and the latter one using 1-varitnas as signatures • HSD-nEB, HSV-nEB, variations that remove EF and HB • HSD-nB, HSV-nB, variations that remove HB • HSD-nR, HSV-nR, variations that remove RD • Baseline algorithm • Scancount (Li et. ICDE08) • State-of-the-art algorithms • Google (Manku et. www07) • Hengine (Liu et. ICDE11)
Query time HSV has the best performance
Candidate Size HSV has the smallest candidate size
Effect of EF and HB EF and HB help improve the performance
Effect of RD RD boost the performance for PubChem Data
Index Size HSV and HSD have a larger candidate size