1 / 32

HmSearch : An Efficient Hamming Distance Query Processing Algorithm

HmSearch : An Efficient Hamming Distance Query Processing Algorithm. Xiaoyang Zhang 1 , Jianbin Qin 1 , Wei Wang 1 , Yifang Sun 1 , Jiaheng Lu 2. 1 University of New South Wales, Australia 2 Renmin University of China, Chnia. Motivation. Identify Near Duplicate Webpages. Chemical data.

etta
Download Presentation

HmSearch : An Efficient Hamming Distance Query Processing Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HmSearch: An Efficient Hamming Distance Query Processing Algorithm Xiaoyang Zhang1,Jianbin Qin1, Wei Wang1,Yifang Sun1, Jiaheng Lu2 1 University of New South Wales, Australia 2 Renmin University of China, Chnia

  2. Motivation • Identify Near Duplicate Webpages • Chemical data Maps in to Binarycode simhash 0012345679ABCDEF 1012345679ABCDEF 012345679ABCDEF0 012345679ABCDEF1 Similar Similar

  3. More Applications • Iris recognition • Image retrieval • C2LSH

  4. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  5. Hamming Distance Query • Hamming distance • Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD Hamming distance(R, S) = 1 v: ACCD Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all vi in V, that hd (vi, Q) <= k

  6. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  7. Basic Idea • General framework: • We can do k=1 efficiently (show later) • So we transform larger k problem to several small k=1 problem by partitioning • We do filtering by looking at each partition • We do verification at last So if k =1, can be filtered by looking at each part q the same v hd (q, v)<=1 hd(qleft, vleft)=0 or hd(qright, vright)=0 q v

  8. Framework Dimension Rearrangement Data Query General Partitioning Scheme Partitioning Partitioning Generating Signatures Generating Signatures 1-variants and 1-deletion variants Candidates0 Indexing Filtering Enhanced Filtering Index Candidates1 Hierarchical Filtering and Verification Verification Results

  9. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  10. Partitioning Lowerbound for partition strategy In our algorithm, we choose Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(qpart, vpart)<= When k= 0 or 1, m=1, hd = 0 When k is even, m = 1 Whenk>=2, hd <= 1 When k is odd, m = 2

  11. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  12. Signature Generation • 1-variants • 1-deletion-variants Substituting each dimension with each domain value each time (plus itself) Substituting each dimension with ‘#’each time v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] OR We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index

  13. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  14. Enhanced Filter (Even) Example Based on the Formula before When k (k>=1) is even, m = 1 q However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(vpart, qpart)=0 m=2, where hd(vpart, qpart)<=1 v If k =2, based on the formula before, m=1, hd(vpart, qpart)=1 So this v becomes a false positive Using enhanced filter, no situation applied sov is filtered

  15. Enhanced Filter (Odd) Example Based on the Formula before When k (k>=1) is odd, m = 2 q However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(vpart, qpart)<=1 and at least one of them = 0 2) m=3, where hd(vpart, qpart)<=1 v If k =3, based on the formula before, m=2, hd(vpart, qpart)=1 So this v becomes a false positive Using enhanced filter, no situation applied sov is filtered

  16. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  17. Hierarchical Filtering and Verification 4 comparisons to calculate hd(v,q)=3 Significant bit v= [5, 0, 3, 6] q= [5, 2, 2, 5] diff So hd(v, q)>=2, filtered 1 0 1 0 XOR 1 0 0 1 0011 3rd OR More over, even if k=4 0110 2nd 0 0 0 1 0 1 1 1 XOR OR 1st 1 0 0 1 XOR 1 0 0 1 0000 0111 hd(v,q)=3 Σ=|8|, k=1 We can use binary operations to do a hierarchical filtering and verification

  18. Hierarchical Filtering and Verification Number of 1 In cumdiff Significant bit v= [5, 0, 3, 6] q= [5, 2, 3, 5] diff cumdiff 0000 OR 1 0 1 0 XOR 1 0 1 1 0001 0001 1 <=1,conti. 3rd OR 0101 2 >1,filtered 0101 2nd 0 0 0 1 XOR 0 1 1 1 1st 1 0 0 1 1 0 0 1

  19. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  20. Impact of Data Skewness Given k=2, then m = 1 and k’=1 Partition1 Partition2 Partition1 Partition2 Dim 1 2 3 4 5 6 Dim 1 2 5 4 3 6 We propose to reset the order andpartition Length to improve performance q 1 1 1 1 0 0 q 1 1 0 1 1 0 v1 1 1 1 0 0 0 v1 1 1 0 0 1 0 v2 0 0 0 2 0 0 v2 0 0 0 2 0 0 v3 2 0 2 0 0 0 v3 2 0 0 0 2 0 v4 3 0 0 0 0 0 v4 3 0 0 0 0 0 Only v1 is qualified All vectors are qualified

  21. Greedy Dimension Rearrangement MaxFreq is the Max Frequency of any values in each dimension MaxFreq for Dim MaxFreq for partition 1 3 3 3 4 4 1 1 4 1 2 4 Partition1 Partition2 Partition1 Partition2 Dim Dim 5 1 2 6 3 4 1 2 3 4 5 6 v1 1 1 1 0 0 0 v1 0 1 1 0 1 0 v2 0 0 0 2 0 0 v2 0 0 0 0 0 2 v3 2 0 2 0 0 0 v3 0 2 0 0 2 0 v4 3 0 0 0 0 0 v4 0 3 0 0 0 0 Our goal: Minimize the global MaxFreq Achieve the goal

  22. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  23. Conclusion • General Partition Scheme • 1-variants and 1-deleltion-variants • Techniques help boost the performance • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement

  24. Outline • Problem Definition • Framework • HmSearch • Partitioning Scheme • Signature Generation • Enhanced Filtering • Hierarchical Filtering and Verification • Dimension Rearrangement • Conclusion • Experiment

  25. Experiment Settings • Environment • Intel Xeon X3330 2.664GHz CPU, 4GB RAM • Debian5.0.6 • AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem) • Ubuntu/Linaro 4.6.4-1 unbuntu5 • All complied with GCC 4.1.2 with –O3 • Dataset

  26. Experiment Settings • Terms • EF, Enhanced Filtering • HB, Hierarchical Binary Filter • RD, Rearranging Dimensions • Our algorithms • HSD, HSV, our proposed algorithms, the former one using 1-deleltion-variants as signatures and the latter one using 1-varitnas as signatures • HSD-nEB, HSV-nEB, variations that remove EF and HB • HSD-nB, HSV-nB, variations that remove HB • HSD-nR, HSV-nR, variations that remove RD • Baseline algorithm • Scancount (Li et. ICDE08) • State-of-the-art algorithms • Google (Manku et. www07) • Hengine (Liu et. ICDE11)

  27. Query time HSV has the best performance

  28. Candidate Size HSV has the smallest candidate size

  29. Effect of EF and HB EF and HB help improve the performance

  30. Effect of RD RD boost the performance for PubChem Data

  31. Index Size HSV and HSD have a larger candidate size

  32. Thank you

More Related