1 / 22

Nearest Neighbor Retrieval Using Distance-Based Hashing

Nearest Neighbor Retrieval Using Distance-Based Hashing. Vassilis Athitsos Michalis Potamias + University of Texas, Arlington Boston University Panagiotis Papapetrou George Kollios Boston University Boston University. nearest neighbor problem. Setting:

Download Presentation

Nearest Neighbor Retrieval Using Distance-Based Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nearest Neighbor Retrieval Using Distance-Based Hashing Vassilis Athitsos Michalis Potamias+ University of Texas, ArlingtonBoston University Panagiotis Papapetrou George Kollios Boston UniversityBoston University

  2. nearest neighbor problem • Setting: • database of objects S • distance function D • Given: • query Q (previously unseen) • Find and Return: • object P* from S, that is closest to Q • NNs appear in various applications under many different distance functions • classification of handwritten digits • hand-pose estimation • Can perform linear scan… • Cost • large S • expensive D Distance Based Hashing

  3. cost model • Dominating cost: Distance function may be very “expensive” • Time series (DTW) • String Alignment (Edit) • Computer vision • Cost Model: minimize number of distance computations Dynamic Programming for Edit Distance Distance Based Hashing

  4. some existing solutions • If objects are low dimensional, exact nearest neighbors are fast • If objects are high dimensional, for some distance functions (Hamming) approximate nearest neighbors are fast, using LSH • However in many interesting settings “linear scan” may be the only approach for exact NNs • high dimensional, non-metric Distance Based Hashing

  5. dbh setting • No assumptions for the distance function • probably non-metric • Distance function computations dominate the cost • Trade perfect accuracy for faster results Distance Based Hashing

  6. dbh method overview • Preprocess: • Hash database using appropriate functions • Query Q arrives: • Hash it! • Filter: Retrieve colliding objects as “candidate NNs” • Refine: Compute the actual distance between query and candidates • Return: Candidate that is closest to Q Distance Based Hashing

  7. Background

  8. Background: hash – based indexing • Building the index • Query Time • Use L tables in parallel h1 database h1 } h1 h2 … L hL Distance Based Hashing

  9. Background: locality sensitive hashing • Choice of Hash Functions is important! • LSH family of functions [IM98] • An LSHF in a Hash-based Indexing scheme guarantees sublinear behavior for approximate NNs! • Such families have been constructed for Hamming, L2… • What if there is no LSH family for the Distance function used? • Edit, DTW etc. Distance Based Hashing

  10. Distance Based Hashing Hash based Indexing scheme Can be applied to any space& any D Its hash functions treat D as a black box Optimization

  11. DBH: family of hash functions • Pseudo-Line projection [FL95] maps an object into the real line • y,z are pivot-points from the database • Project x on the y-z pseudoline • Use a threshold to make it discrete valued • - - This family is not an LSHF • ++ Definition does not depend on the specific distance function, only on the 3 pairwise distances. Distance Based Hashing

  12. DBH: method • Preprocessing: • Use a random choice of K ofthese pseudoline projections to define a hash function • Build L such (K-bit) functions • Hash all objects of S to the L h-tables • At query time: • Apply the same L functions to Q • Filter : Retrieve colliding objects (candidate set) • Refine: Invoke D for candidates • Return: Nearest* Distance Based Hashing

  13. DBH: accuracy vs cost • Accuracy : Percentage of queries for which DBH returns true NN • Cost: Amount of distance computations • Problem: Given desired accuracy minimize the cost • Choice of K,Laffects the cost and the accuracy • Sampling: approximate distributions • Probability of NNs colliding • Probability of non-NNs colliding • Perform binary search for best (K,L) K, L Distance Based Hashing

  14. DBH: accuracy • Probability of collision between any Query Q and its Nearest Neighbor N(Q) for a single projection function • Employ sampling to estimate C(Q,N(Q)) • Use K and L to shift distribution to desired accuracy • Probability of collision in at least one of the L K-bit tables • …and compute  Distance Based Hashing

  15. DBH: cost • Hash and LookUp  • HashCost: Number of distance computations to evaluate hash functions • LookupCost: number of objects that collide in at least one of the Lhash tables • Query Cost: • Total Cost (for all Queries): Distance Based Hashing

  16. DBH: further optimization • Hierarchical DBH • Build M parallel DBH indices for different subsets of queries • Partition according to distribution D(Q,N(Q)) • Queries that are close to their NN are “easier” • Reduce HashCost by restricting HDBH to a small subset of database pivot-points for the projections Distance Based Hashing

  17. Experiments

  18. experiments: datasets • We test DBH on 3 datasets: • Unipen (timeseries ~30 – digits) • Dynamic Time Warping • 10K (test: 5K) • MNIST (images 28x28 – digits) • Shape Context Matching • 60K (test: 10K) • Hands (images 256x256 – hand-pose) • Chamfer Distance • 80K (test: 1K) Distance Based Hashing

  19. experiments: results • Training-set • to opt K, L • Test-set  experiment • Compare to modified VP-tree • handles non-metric data • Accuracy vs Cost plot • X-axis : Accuracy • Y-axis : Distance Computations Distance Based Hashing

  20. experiments: results Distance Based Hashing

  21. conclusion • Distance Based Hashing is a hash-based indexing framework for NN retrieval • Not sublinear, just speedup • General purpose: No properties assumed for distance function - black box • May be further optimized for bigger speedups • Future: Can we build a scheme for “black box” distance function and provide a statistical argument for sublinear behavior to the size of the database? Distance Based Hashing

  22. thank you! Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης) Distance Based Hashing

More Related