A Metric Cache for Similarity Search

A Metric Cache forSimilarity Search fabriziofalchiclaudiolucchesesalvatoreorlandofaustorabittiraffaeleperego

Content-BasedImage Retrieval Similarity Search in Databases • Objects are “unknown”, only distances are “well known” • Metric Space assumption: • Identity • Symmetry • Triangular inequality • Distance functions include: • Minkowski distances, edit and Jaccard distance, ... • Applications include: • Images, 3D shapes, medical data, text, dna sequences, graphs, etc. • Metric space indexing works betterthan multidimensional indexing. query:

Distributed Similarity Search System Top-K queries Front-end of the CBIR System Search cost is close to O( |DB| ) !!!! Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n

& Cached Distributed Similarity Search System Top-K queries Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n

& Cached Distributed Similarity Search System What’s different in ? Top-K queries Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n

What’s different in ? Metric Cache • The cache stores result-objects, not only result-pointers • e.g.: documents vs. documents ids • The cache is a peculiar sample of the whole dataset • the set of objects most recently seenby the users (= most interesting !?!) • Claim: An interesting object may be used to answer approximately if it is sufficiently similar to the query.

What’s different in ? Metric Cache …and… • Queries may be approximate ! • [Zobel et al. CIVR 07]At least 8% of the images in the web are near-duplicates. Most of them are due to cropping, contrast adjustment, etc. • Requirement: the system must be robust w.r.t. near-duplicate queries.

What’s different in ? Metric Cache q1 q2 q3 Approx. answer Exact answer Exact answer Metric Cache Front-end of the CBIR System Parallel & DistributedCBIR System

Two algorithms: RCache vs. QCache RCache(q,k) • IfqCachereturnR • R = Cache.knn(q,k) • If quality(R) > returnR • else R = DB.knn(q,k) Cache.add(q,R) returnR Cache Hit Approximate Hit Cache Miss • In case of approximate hit, the cached query q’, being the closest to q, is marked as used. • The Least Recently Used query and its results are evicted.

Two algorithms: RCache vs. QCache • QCache(q,k) • IfqCachereturnR • Q* = Cache.knn(q, ) • R* = {results in Q*} • R = R*.knn(q,k) • If quality(R) > • returnR • else • R = DB.knn(q,k) • Cache.add(q,R) • returnR RCache(q,k) • IfqCachereturnR • R = Cache.knn(q,k) • If quality(R) > returnR • else R = DB.knn(q,k) Cache.add(q,R) returnR Cache Hit Approximate Hit Cache Miss • In case of approximate hit, the cached query q’, being the closest to q, is marked as used. • The Least Recently Used query and its results are evicted.

Approximation & Guarantees • Let the safe range be: s = r* - d( q,q* ) • The cached k* objects within distance s, are the true top-k* of the new query. • Every cached query may provide some additional guarantee. r* q* s q

Experimental setup • A collection of 1,000,000 images downloaded from Flickr: • we extracted 5 MPEG-7 descriptors, which were used to measure similarity. • A query log of 100,000 images: • a random sample with replacement,using image views • 20% training – 80% testing • k = 20,  = 10 • Quality function is: • Safe range  0

Hit ratio

Throughput

Approximation quality I

Approximation quality II

What we also did ... • Take queries from a different collection. • Inject duplicates in the query log. • Use an expectation of RES as quality measure.

Approximation quality III • Bug: RES=0.07 • Portrait: RES=0.09 • Sunset: RES=0.12

Acknowledgements • The European Project SAPIR • http://www.sapir.eu • P. Zezula and his colleagues for the M-Tree implementation • http://mufin.fi.muni.cz/tiki-index.php • The dataset used is available at • http:// cophir.isti.cnr.it

Thank you.

Backup slides

A Metric Cache for Similarity Search

A Metric Cache for Similarity Search

Presentation Transcript

Seeds for Similarity Search

Similarity Search on Bregman Divergence, Towards Non-Metric Indexing

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

Tree-based indexing methods for similarity search in metric and nonmetric spaces

E fficient similarity search in metric and nonmetric spaces

A General Algorithm for Subtree Similarity-Search

Scalable and Distributed Similarity Search in Metric Spaces

Database Similarity Search

Similarity Search for Web Services

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces

Cache-Conscious Performance Optimization for Similarity Search

Similarity Search

SIMILARITY SEARCH The Metric Space Approach

M- tree: an efficient access method for similarity search in metric spaces

SIMILARITY SEARCH The Metric Space Approach

Similarity Search: A Matching Based Approach

Efficient Similarity Search with Cache-Conscious Data Traversal

Operators for Similarity Search

SIMILARITY SEARCH The Metric Space Approach

Database Similarity Search

SIMILARITY SEARCH The Metric Space Approach