Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces Wei Dong Joint work with Moses Charikar and Kai Li Computer Science Department Princeton University

Motivations • Feature-rich data grow explosively • Images, audio, video, scientific sensor data • Content-based retrieval needed • Features extracted with domain specific algorithms • K-NN search in feature space • Feature size becomes the bottleneck • Features are of high dimensions and trees fail • New methods like LSH are mostly smart ways of scanning

Feature size is growing • Domain experts are shooting for high precision, and new methods are being developed • Example: image features • ~1995: 166D color histogram • 1999: SIFT 500 × 128D vectors/image = 64KB/image That’s almost the size of a image!

Sketch Database Sketch Construction Sketch Construction Input Data Objects Input Query Object Feature Extraction Feature Extraction Filtering Results Ranking with features K-NN Search with Sketches • Sketch: compact approximation of a large object [Lv04] feature vector  bit vector L1 distance  hamming distance

Our contribution • A new sketch for L2 distance • Asymmetric distance estimation • Using sketch of a data point + the raw query point • Applies to our proposed sketch and others • Evaluation with real life image and audio data • Sketches of < 10% the feature size for > 90% recall • Further 20% ~ 40% size reduction with asymmetric estimators

L2 Sketch: the Idea • Randomly partition the space into stripes • Orange = 1; white = 0 • More random partitions to make a bit vector • Hamming distance reflects point proximity

W L2 Sketch: the Proposed Scheme

Sketch Database Sketch Construction Sketch Construction Input Data Objects Input Query Object Feature Extraction Feature Extraction Filtering Results Ranking with features Asymmetric Estimator: the Idea Information Loss ! • Query points are available at query time ! Information Loss !

Sketch Database Sketch Construction Input Data Objects Feature Extraction Filtering Input Query Object Feature Extraction Results Ranking with features Asymmetric Estimator: the Idea • Exploit the query features for high precision How ?

p3 Asymmetric Estimator: L2 Sketch • Partitions are not equally good to a query point • Weight each partition with its quality • Weight: distance between q and stripe boundary Bad !

Asymmetric Estimator: L2 Sketch

Generalize the Asymmetric Idea • 0/1 valued function  bipartition of the space • Asymmetric estimator: weighted hamming distance

Example: Random Hyper-plane Sketch • Cosine Similarity • Partition with random hyper-plane [Charikar’02]

Evaluation Datasets • Image: Caltech 101 • 101 categories, 9144 images in total • Feature extraction with SIFT • 4.5 million 128D features • Audio: LDC-SWITCHBOARD-1 collection • 2,400 phone conversation among 543 US speakers • Segmentation and feature extraction with Marsyas • 2.5 million 192D features We use floating point numbers for all feature vectors.

L2 Distance Estimation • We want to see the relationship between • Estimation error and real distance • Estimation error and sketch size • Methods compared • Sketches with symmetric and asymmetric estimators • Our proposed L2 sketch • Random hyper-plane sketch (converted to L2 distance) • PCA and Random projection as baselines • Image data only for this task

Error vs. L2 distance • Sample random point pairs • Estimate the distance and measure the error • All methods use 32 bytes for one sketch • Bin according to real distance

Error vs. L2 distance • Our sketch scheme has tunable sensitive range Average 100-NN distance + one sigma

Error vs. Sketch Size • Use points with real distances within [0, 300) • Asymmetric estimator reduces MSE by half

K-NN Search • Search for 100-NN • Filter with sketch to obtain 2000 candidate • Rank with raw features and return the top 100 • Sizing sketch to meet specific average recall • Recall = % of true K-NNs retrieved • We always return 100 points, and precision = recall

Results with Image Dataset Sketch size /byte need ed to achieve given recall. Raw feature size: 496 bytes

Results with Audio Dataset Sketch size /byte need ed to achieve given recall. Raw feature size: 768 bytes

Conclusion • A new sketch for L2 distance • Tunable sensitive distance range • New idea of asymmetric distance estimator • Exploit the query/data storage asymmetry • Applies to different sketch schemes • 20% to 40% space reduction with our datasets

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces

Presentation Transcript

Seeds for Similarity Search

Top-k String Similarity Search with Edit-Distance Constraints

Estimation of failure probability in higher-dimensional spaces

Asymmetric Word Similarity

Multi -Attribute Spaces: Calibration for Attribute Fusion and Similarity Search

Tree-based indexing methods for similarity search in metric and nonmetric spaces

E fficient similarity search in metric and nonmetric spaces

On Improving the Clearance for Robots in High-Dimensional Configuration Spaces

Efficient Sketches for Earth-Mover Distance, with Applications

Scalable and Distributed Similarity Search in Metric Spaces

Particle Methods for High-Dimensional Traffic Estimation Problems

Feature Extraction for Outlier Detection in High-Dimensional Spaces

Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces

Clustering and Indexing in High-dimensional spaces

Similarity Search

Searching in High-Dimensional Spaces

Google Similarity Distance

Similarity Search in High Dimensions via Hashing

Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces

M- tree: an efficient access method for similarity search in metric spaces

Operators for Similarity Search

Estimation of failure probability in higher-dimensional spaces