250 likes | 374 Views
Content-Based Similarity Search. Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li. Motivation. Massive amounts of feature-rich digital data Audio, video, digital photos, scientific sensor data
E N D
Content-Based Similarity Search Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li
Motivation • Massive amounts of feature-rich digital data • Audio, video, digital photos, scientific sensor data • Noisy, high-dimensional • Traditional file systems/search tools inadequate • Exact match • Keyword-based search • Annotations • Need content-based similarity search
Motivation • Recent progress of theoretical studies on sketches • compact data representation for estimation of pairwise similarity/distance • Compact data structures for high-quality and efficient content-basedsimilarity search?
Compact representation sketch complex object 0 1 0 1 1 0 0 1 1 0 • Distance measured by (weighted) ℓ1 distanced(x,y) = Σi wi·|xi-yi| • Better still, hamming distance between bit vectors • Distance between sketches estimates distance between objects • Several theoretical constructions of sketches forsets, vectors, earth mover distance (EMD). 0 0 1 0 1 1 0 0 1 0
Outline • Motivation • System architecture • Implementation details • Segmentation & feature extraction • Sketch construction • Filtering • Indexing • Performance evaluation • Conclusions & future work
Similarity Search Engine Architecture Pre-processing Query time
Similarity Search Problem • Similarity search: finding objects similar to a query object i.e. containing similar features • Object representation • Distance function d (X, Y) • Nearest neighbor query • K-nearest neighbor (KNN) • Approximate nearest neighbor (ANN)
0.2 0.3 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.4 0.2 0.1 0.2 0.1 0.1 Object Representation & Distance Function Earth Mover Distance (EMD)
Segmentation & Feature Extraction (1) • Derive a small set of features that characterize the important attributes of a data object • Data-dependent
Segmentation & Feature Extraction (1) • Image Data • JSEG image segmentation tool • Each segments by a 14-dimension feature vector • Color moments • First three moments in HSV color space 9-D vector • Bounding box • Aspect ratio, Bounding box size, Area ratio, Region centroid • 5-D vector • Segment weight square root of segment size • ℓ1 distance between segments, EMD between images
Segmentation & Feature Extraction (2) • Audio Data • Phonetic segmentation & feature extraction using MARSYAS • Each segment • 50 sliding windows x 6 MFCC parameters = 300 • Segment weight segment length • Segment distance: ℓ1 distance • Sentence distance: EMD
Segmentation & Feature Extraction (3) • 3D shape data • 32 decomposing spheres • Spherical harmonic descriptor (SHD) • Spherical harmonic coefficients up to order 16 • 32 x 17 = 544 dimensions • ℓ2 distance
x1 y1 x = (x1,x2,x3,x4) x2 y2 y = (y1,y2,y3,y4) y3 x3 x4 y4 0 1 Sketch Construction • Sketches: tiny data structures that can be used to estimate properties of original data • High-dimensional feature vector → NK bit vector • hamming distance original feature vector distance • XOR groups of K bits → N bit vector • hamming distance thresholded distance
Filtering for Similarity Search • EMD computation is expensive • Filtering • Scans through the entire dataset • Uses a much faster distance function to filter out “bad” answers • Computes EMD for a much smaller candidate set • Criteria in picking candidate objects • Has at least one segment that is close enough to one of the top segments of the query object
a leveled tree where each level is a “cover” for the level beneath it Nesting: Covering tree: For every node , there exists a node satisfying and exactly one such q is a parent of p Separation: For all nodes , Indexing for Similarity Search
Performance Evaluation • Can we achieve high-quality similarity search results at high speed? • How small can the sketches be as the metadata of the similarity search engine? • What are the performance tradeoffs of • Brute-force • Filtering • Indexing
Benchmarks • Search quality benchmark suite • VARY image: 10k images, 32 sets • TIMIT audio: 6300 sentences, 450 sets • PSB shape: 1814 models, 92 sets • Search speed benchmark suite • Mixed image dataset: 600k images • Mixed audio dataset: 60K sentences • Mixed shape dataset: 40k shape models
Search Quality Metrics Given a query q with k similar objects: • First-tier • Percentage of similar objects returned within rank k • Second-tier • Percentage of similar objects returned within rank 2k • Average precision
Conclusions & Future Work • A general purpose content-based similarity search system • high-quality similarity search with reasonably high speed • Using sketches reduces metadata size • Filtering & indexing speeds up similarity search • Future work • More efficient distance function than EMD • Further investigation of indexing data structures • More data types: • video, genomic microarray data, other sensor data