220 likes | 355 Views
Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces. Wei Dong Joint work with Moses Charikar and Kai Li Computer Science Department Princeton University. Motivations. Feature-rich data grow explosively Images, audio, video, scientific sensor data
E N D
Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces Wei Dong Joint work with Moses Charikar and Kai Li Computer Science Department Princeton University
Motivations • Feature-rich data grow explosively • Images, audio, video, scientific sensor data • Content-based retrieval needed • Features extracted with domain specific algorithms • K-NN search in feature space • Feature size becomes the bottleneck • Features are of high dimensions and trees fail • New methods like LSH are mostly smart ways of scanning
Feature size is growing • Domain experts are shooting for high precision, and new methods are being developed • Example: image features • ~1995: 166D color histogram • 1999: SIFT 500 × 128D vectors/image = 64KB/image That’s almost the size of a image!
Sketch Database Sketch Construction Sketch Construction Input Data Objects Input Query Object Feature Extraction Feature Extraction Filtering Results Ranking with features K-NN Search with Sketches • Sketch: compact approximation of a large object [Lv04] feature vector bit vector L1 distance hamming distance
Our contribution • A new sketch for L2 distance • Asymmetric distance estimation • Using sketch of a data point + the raw query point • Applies to our proposed sketch and others • Evaluation with real life image and audio data • Sketches of < 10% the feature size for > 90% recall • Further 20% ~ 40% size reduction with asymmetric estimators
L2 Sketch: the Idea • Randomly partition the space into stripes • Orange = 1; white = 0 • More random partitions to make a bit vector • Hamming distance reflects point proximity
W L2 Sketch: the Proposed Scheme
Sketch Database Sketch Construction Sketch Construction Input Data Objects Input Query Object Feature Extraction Feature Extraction Filtering Results Ranking with features Asymmetric Estimator: the Idea Information Loss ! • Query points are available at query time ! Information Loss !
Sketch Database Sketch Construction Input Data Objects Feature Extraction Filtering Input Query Object Feature Extraction Results Ranking with features Asymmetric Estimator: the Idea • Exploit the query features for high precision How ?
p3 Asymmetric Estimator: L2 Sketch • Partitions are not equally good to a query point • Weight each partition with its quality • Weight: distance between q and stripe boundary Bad !
Generalize the Asymmetric Idea • 0/1 valued function bipartition of the space • Asymmetric estimator: weighted hamming distance
Example: Random Hyper-plane Sketch • Cosine Similarity • Partition with random hyper-plane [Charikar’02]
Evaluation Datasets • Image: Caltech 101 • 101 categories, 9144 images in total • Feature extraction with SIFT • 4.5 million 128D features • Audio: LDC-SWITCHBOARD-1 collection • 2,400 phone conversation among 543 US speakers • Segmentation and feature extraction with Marsyas • 2.5 million 192D features We use floating point numbers for all feature vectors.
L2 Distance Estimation • We want to see the relationship between • Estimation error and real distance • Estimation error and sketch size • Methods compared • Sketches with symmetric and asymmetric estimators • Our proposed L2 sketch • Random hyper-plane sketch (converted to L2 distance) • PCA and Random projection as baselines • Image data only for this task
Error vs. L2 distance • Sample random point pairs • Estimate the distance and measure the error • All methods use 32 bytes for one sketch • Bin according to real distance
Error vs. L2 distance • Our sketch scheme has tunable sensitive range Average 100-NN distance + one sigma
Error vs. Sketch Size • Use points with real distances within [0, 300) • Asymmetric estimator reduces MSE by half
K-NN Search • Search for 100-NN • Filter with sketch to obtain 2000 candidate • Rank with raw features and return the top 100 • Sizing sketch to meet specific average recall • Recall = % of true K-NNs retrieved • We always return 100 points, and precision = recall
Results with Image Dataset Sketch size /byte need ed to achieve given recall. Raw feature size: 496 bytes
Results with Audio Dataset Sketch size /byte need ed to achieve given recall. Raw feature size: 768 bytes
Conclusion • A new sketch for L2 distance • Tunable sensitive distance range • New idea of asymmetric distance estimator • Exploit the query/data storage asymmetry • Applies to different sketch schemes • 20% to 40% space reduction with our datasets