1 / 20

Data-Dependent Hashing: Near-Linear Space and Improved Algorithms

Explore data-dependent hashing, time-space trade-offs, LSH in Euclidean spaces, and new approaches for efficient computation. Discover Hyperplanes and Voronoi algorithms for better space bounds. Learn about density clusters and the trade-off concept in hash functions.

mtschida
Download Presentation

Data-Dependent Hashing: Near-Linear Space and Improved Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 12:More LSHData-dependent hashing

  2. Announcements & Plan • PS3: • Released tonight, due next Fri 7pm • Class projects: think of teams! • I’m away until Wed • Office hours on Thu after class • Kevin will teach on Tue • Evaluation on courseworks next week • LSH: better space • Data-dependent hashing • Scriber?

  3. Time-Space Trade-offs (Euclidean) query time space low high medium medium 1 mem lookup high low

  4. Near-linear Space for [Indyk’01, Panigrahy’06] • Setting: • Close: • Far: • Algorithm: • Use one hash table with • On query : • compute • Repeat times: • flip each with probability • look up bucket and compute distance to all points there • If found an approximate near neighbor, stop Sample a few buckets in the same hash table!

  5. Near-linear Space • Theorem: for , we have: • Pr[find an approx near neighbor] • Expected runtime: • Proof: • Let be the near neighbor: • , • Claim 1: • Claim 2: • Claim 3: • If at least for one , we are guaranteed to output either or an approx. near neighbor

  6. Beyond LSH Hamming space LSH Euclidean space LSH

  7. New approach? Data-dependent hashing • A random hash function, chosen after seeing the given dataset • Efficiently computable

  8. Construction of hash function [A.-Indyk-Nguyen-Razenshteyn’14, A.-Razenshteyn’15] has better LSH data-dependent • Warning: hot off the press! • Two components: • Nice geometric structure • Reduction to such structure

  9. Nice geometric structure • Like a random dataset on a sphere • s.t. random points at distance • Query: • At angle 45’ from near-neighbor [Pictures curtesy of Ilya Razenshteyn]

  10. Alg 1: Hyperplanes [Charikar’02] • Sample unit uniformly, hash into • , • where is the angle between and

  11. Alg 2: Voronoi [A.-Indyk-Nguyen-Razenshteyn’14] based on [Karger-Motwani-Sudan’94] • Sample i.i.d. standard -dimensional Gaussians • Hash into • is simply Hyperplane LSH

  12. Hyperplane vs Voronoi • Hyperplane with hyperplanes • Means we partition space into pieces • Voronoi with vectors • grids vs spheres

  13. NNS: conclusion • 1. Via sketches • 2. Locality Sensitive Hashing • Random space partitions • Better space bound • Even near-linear! • Data-dependent hashing even better • Used in practice a lot these days

  14. The following was not presented in the lecture

  15. Reduction to nice structure (HL) Why ok? • Why ok? • no dense clusters • like “random dataset” with radius= • even better! radius = radius = • Idea: iteratively decrease the radius of minimum enclosing ball • Algorithm: • find dense clusters • with smaller radius • large fraction of points • recurse on dense clusters • apply VoronoiLSH on the rest • recurse on each “cap” • eg, dense clusters might reappear *picture not to scale & dimension

  16. Hash function radius = Described by a tree (like a hash table) *picture not to scale&dimension

  17. Dense clusters trade-off trade-off ? • Current dataset: radius • A dense cluster: • Contains points • Smaller radius: • After we remove all clusters: • For any point on the surface, there are at most points within distance • The other points are essentially orthogonal ! • When applying Cap Carving with parameters : • Empirical number of far pts colliding with query: • As long as , the “impurity” doesn’t matter!

  18. Tree recap trade-off • During query: • Recurse in all clusters • Just in one bucket in VoronoiLSH • Will look in >1 leaf! • How much branching? • Claim: at most • Each time we branch • at most clusters () • a cluster reduces radius by • cluster-depth at most • Progress in 2 ways: • Clusters reduce radius • CapCarving nodes reduce the # of far points (empirical ) • A tree succeeds with probability

  19. Fast preprocessing • How to find the dense clusters fast? • Step 1: reduce to time. • Enough to consider centers that are data points • Step 2: reduce to near-linear time. • Down-sample! • Ok because we want clusters of size • After downsampling by a factor of , a cluster is still somewhat heavy.

  20. Other details • In the analysis, • Instead of working with “probability of collision with far point” , work with “empirical estimate” (the actual number) • A little delicate: interplay with “probability of collision with close point”, • The empirical important only for the bucket where the query falls into • Need to condition on collision with close point in the above empirical estimate • In dense clusters, points may appear inside the balls • whereas VoronoiLSH works for points on the sphere • need to partition balls into thin shells (introduces more branching)

More Related