1 / 22

Similarity Search in High Dimensions via Hashing

Similarity Search in High Dimensions via Hashing. Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented by Jiyun Byun Vision Research Lab in ECE at UCSB. Outline. Introduction Locality Sensitive Hashing Analysis Experiments

gale
Download Presentation

Similarity Search in High Dimensions via Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented by Jiyun Byun Vision Research Lab in ECE at UCSB

  2. Outline • Introduction • Locality Sensitive Hashing • Analysis • Experiments • Concluding Remarks

  3. Introduction • Nearest neighbor search (NNS) • The curse of dimensionality • experimental approach : use heuristic • analytical approach • Approximate approach ε-Nearest Neighbor Search (ε-NNS) • Goal : for any given query q Rd, returns the points p  P where d(q,P) is the distance of q to the its closest points in P • right answers are much closer than irrelevant ones • time/quality trade off

  4. Locality Sensitive Hashing (LSH) • Collision probability depends on distance between points • higher collision probability for close objects • small collision probability for those that far apart • Given a query point, • hash it using a set of hash functions • inspect the entries in each bucket

  5. Locality Sensitive Hashing

  6. Locality Sensitive Hashing (LSH)Setting • C : the largest coordinate among all points in the given dataset P of dimension d (Rd) • Embed P into the Hamming cube {0,1}d’ • dimension d’ = Cd • v(p) = UnaryC(x1)…UnaryC(xd) • use the unary code for each point along each dimension • isometric embedding • d1(p,q) = dH(v(p),v(q)) • embedding preserves the distance between points

  7. Locality Sensitive Hashing (LSH)Hash functions(1/2) • Build a hash function on Hamming cube in d’ dimensions • Choose L subsets of the dimensions: I1,I2, ..IL • Ijconsists of k elements from {1,…,d’} • found by sampling uniformly at random with replacement • Project each point on each Ij. • gj(p) = projection of p on Ij obtained by concatenating the bit values of p for dimensions Ij • Store p in buckets gj(p), j = 1.. L

  8. Locality Sensitive Hashing (LSH)Hash functions(2/2) • Two levels of hashing • LSH function • maps a point p to bucket gj(p) • standard hash function • maps the contents of buckets into a hash table of size M B : bucket capacity  : memory utilization parameter

  9. Query processing • Search buckets gj(q) • until CL points are found or all L indices are searched. • Approximate K-NNS • output the K points closest to q • fewer if less than K points are found • -neighbor with parameter r

  10. Analysis • where r1 < r2 and P1>P2 • Family of single projections in Hamming cube Hd’ is (r, r(1+ ), 1-r/d’, 1- r(1+ )/d’) sensitive • if dH(q,p) = r (r bits on which p and q differ)Pr[ h(q) h(p)] = r/d’

  11. LSH solve(r+ ) Neighbor problem • Determine if • there exists a point within distance r of query point q • or whether all points are at least a distance r(1+ ) away from q • In the former case, • return a point within distance r(1+) of q. • Repeat construction to boost the probability.

  12. ε-NN problem • For a given query point q, • return a point p from the dataset P • multiple instances of (r, )-neighbor solution. • (r0, )-neighbor, (r0(1+ ), )-neighbor, (r0(1+ )2, )-neighbor, …,rmax neighbor

  13. Experiments(1/3) • Datasets • color histograms (Corel Draw) • n = 20,000; d= 8,…,64 • texture features (Aerial photos) • n = 270,000; d = 60 • Query sets • Disk • second level bucket is directly mapped to a disk block

  14. Experiments(2/3) • profiles Normalized frequency Normalized frequency Interpoint distance Interpoint distance color histogram texture features

  15. Experiments(3/3) • Performance • speed : average number of blocks accessed • effective error • dLSH : LSH NN distance(q) , d* : NN distance(q) • miss ratio • the fraction of queries for which no answer was found

  16. Experiments :color histogram(1/4) • Error vs. Number of indices(L)

  17. Experiments : color histogram(2/4) • Dependence on n Disk Accesses Disk Accesses Number of database points Number of database points Approximate 1 NNS Approximate 10 NNS

  18. Experiments : color histogram(3/4) • Miss ratios Miss ratio Miss ratio Number of database points Number of database points Approximate 1 NNS Approximate 10 NNS

  19. Experiments : color histogram(4/4) • Dependence on d Disk Accesses Disk Accesses Number of dimension Number of dimension Approximate 1 NNS Approximate 10 NNS

  20. Experiments : texture features(1/2) • Number of indices vs. Error

  21. Experiments : texture features(2/2) • Number of indices vs. Size

  22. Concluding remarks • Locality Sensitive Hashing • fast approximation • dynamic/join version • Future work • hybrid techniques : tree-based and hashing-based

More Related