Nearest Neighbor Retrieval Using Distance-Based Hashing

Nearest Neighbor Retrieval Using Distance-Based Hashing Vassilis Athitsos Michalis Potamias+ University of Texas, ArlingtonBoston University Panagiotis Papapetrou George Kollios Boston UniversityBoston University

nearest neighbor problem • Setting: • database of objects S • distance function D • Given: • query Q (previously unseen) • Find and Return: • object P* from S, that is closest to Q • NNs appear in various applications under many different distance functions • classification of handwritten digits • hand-pose estimation • Can perform linear scan… • Cost • large S • expensive D Distance Based Hashing

cost model • Dominating cost: Distance function may be very “expensive” • Time series (DTW) • String Alignment (Edit) • Computer vision • Cost Model: minimize number of distance computations Dynamic Programming for Edit Distance Distance Based Hashing

some existing solutions • If objects are low dimensional, exact nearest neighbors are fast • If objects are high dimensional, for some distance functions (Hamming) approximate nearest neighbors are fast, using LSH • However in many interesting settings “linear scan” may be the only approach for exact NNs • high dimensional, non-metric Distance Based Hashing

dbh setting • No assumptions for the distance function • probably non-metric • Distance function computations dominate the cost • Trade perfect accuracy for faster results Distance Based Hashing

dbh method overview • Preprocess: • Hash database using appropriate functions • Query Q arrives: • Hash it! • Filter: Retrieve colliding objects as “candidate NNs” • Refine: Compute the actual distance between query and candidates • Return: Candidate that is closest to Q Distance Based Hashing

Background

Background: hash – based indexing • Building the index • Query Time • Use L tables in parallel h1 database h1 } h1 h2 … L hL Distance Based Hashing

Background: locality sensitive hashing • Choice of Hash Functions is important! • LSH family of functions [IM98] • An LSHF in a Hash-based Indexing scheme guarantees sublinear behavior for approximate NNs! • Such families have been constructed for Hamming, L2… • What if there is no LSH family for the Distance function used? • Edit, DTW etc. Distance Based Hashing

Distance Based Hashing Hash based Indexing scheme Can be applied to any space& any D Its hash functions treat D as a black box Optimization

DBH: family of hash functions • Pseudo-Line projection [FL95] maps an object into the real line • y,z are pivot-points from the database • Project x on the y-z pseudoline • Use a threshold to make it discrete valued • - - This family is not an LSHF • ++ Definition does not depend on the specific distance function, only on the 3 pairwise distances. Distance Based Hashing

DBH: method • Preprocessing: • Use a random choice of K ofthese pseudoline projections to define a hash function • Build L such (K-bit) functions • Hash all objects of S to the L h-tables • At query time: • Apply the same L functions to Q • Filter : Retrieve colliding objects (candidate set) • Refine: Invoke D for candidates • Return: Nearest* Distance Based Hashing

DBH: accuracy vs cost • Accuracy : Percentage of queries for which DBH returns true NN • Cost: Amount of distance computations • Problem: Given desired accuracy minimize the cost • Choice of K,Laffects the cost and the accuracy • Sampling: approximate distributions • Probability of NNs colliding • Probability of non-NNs colliding • Perform binary search for best (K,L) K, L Distance Based Hashing

DBH: accuracy • Probability of collision between any Query Q and its Nearest Neighbor N(Q) for a single projection function • Employ sampling to estimate C(Q,N(Q)) • Use K and L to shift distribution to desired accuracy • Probability of collision in at least one of the L K-bit tables • …and compute  Distance Based Hashing

DBH: cost • Hash and LookUp  • HashCost: Number of distance computations to evaluate hash functions • LookupCost: number of objects that collide in at least one of the Lhash tables • Query Cost: • Total Cost (for all Queries): Distance Based Hashing

DBH: further optimization • Hierarchical DBH • Build M parallel DBH indices for different subsets of queries • Partition according to distribution D(Q,N(Q)) • Queries that are close to their NN are “easier” • Reduce HashCost by restricting HDBH to a small subset of database pivot-points for the projections Distance Based Hashing

Experiments

experiments: datasets • We test DBH on 3 datasets: • Unipen (timeseries ~30 – digits) • Dynamic Time Warping • 10K (test: 5K) • MNIST (images 28x28 – digits) • Shape Context Matching • 60K (test: 10K) • Hands (images 256x256 – hand-pose) • Chamfer Distance • 80K (test: 1K) Distance Based Hashing

experiments: results • Training-set • to opt K, L • Test-set  experiment • Compare to modified VP-tree • handles non-metric data • Accuracy vs Cost plot • X-axis : Accuracy • Y-axis : Distance Computations Distance Based Hashing

experiments: results Distance Based Hashing

conclusion • Distance Based Hashing is a hash-based indexing framework for NN retrieval • Not sublinear, just speedup • General purpose: No properties assumed for distance function - black box • May be further optimized for bigger speedups • Future: Can we build a scheme for “black box” distance function and provide a statistical argument for sublinear behavior to the size of the database? Distance Based Hashing

thank you! Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης) Distance Based Hashing

Nearest Neighbor Retrieval Using Distance-Based Hashing