260 likes | 396 Views
A Metric Cache for Similarity Search. fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego. Content-Based Image Retrieval. Similarity Search in Databases. Objects are “unknown” , only distances are “well known” Metric Space assumption: Identity
E N D
A Metric Cache forSimilarity Search fabriziofalchiclaudiolucchesesalvatoreorlandofaustorabittiraffaeleperego
Content-BasedImage Retrieval Similarity Search in Databases • Objects are “unknown”, only distances are “well known” • Metric Space assumption: • Identity • Symmetry • Triangular inequality • Distance functions include: • Minkowski distances, edit and Jaccard distance, ... • Applications include: • Images, 3D shapes, medical data, text, dna sequences, graphs, etc. • Metric space indexing works betterthan multidimensional indexing. query:
Distributed Similarity Search System Top-K queries Front-end of the CBIR System Search cost is close to O( |DB| ) !!!! Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n
& Cached Distributed Similarity Search System Top-K queries Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n
& Cached Distributed Similarity Search System What’s different in ? Top-K queries Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n
What’s different in ? Metric Cache • The cache stores result-objects, not only result-pointers • e.g.: documents vs. documents ids • The cache is a peculiar sample of the whole dataset • the set of objects most recently seenby the users (= most interesting !?!) • Claim: An interesting object may be used to answer approximately if it is sufficiently similar to the query.
What’s different in ? Metric Cache …and… • Queries may be approximate ! • [Zobel et al. CIVR 07]At least 8% of the images in the web are near-duplicates. Most of them are due to cropping, contrast adjustment, etc. • Requirement: the system must be robust w.r.t. near-duplicate queries.
What’s different in ? Metric Cache q1 q2 q3 Approx. answer Exact answer Exact answer Metric Cache Front-end of the CBIR System Parallel & DistributedCBIR System
Two algorithms: RCache vs. QCache RCache(q,k) • IfqCachereturnR • R = Cache.knn(q,k) • If quality(R) > returnR • else R = DB.knn(q,k) Cache.add(q,R) returnR Cache Hit Approximate Hit Cache Miss • In case of approximate hit, the cached query q’, being the closest to q, is marked as used. • The Least Recently Used query and its results are evicted.
Costs of RCache vs. QCache RCache(q,k) • Hash table access : O(1) • Search among all the result objects:O( |Cache| ) • Search among all the objects in the database:O( |DB| ) Cache Hit Approximate Hit Cache Miss • |Cache| is the number of cached objects, and |DB|is the size of the database.
Two algorithms: RCache vs. QCache • QCache(q,k) • IfqCachereturnR • Q* = Cache.knn(q, ) • R* = {results in Q*} • R = R*.knn(q,k) • If quality(R) > • returnR • else • R = DB.knn(q,k) • Cache.add(q,R) • returnR RCache(q,k) • IfqCachereturnR • R = Cache.knn(q,k) • If quality(R) > returnR • else R = DB.knn(q,k) Cache.add(q,R) returnR Cache Hit Approximate Hit Cache Miss • In case of approximate hit, the cached query q’, being the closest to q, is marked as used. • The Least Recently Used query and its results are evicted.
Costs of RCache vs. QCache • QCache(q,k) • Search among thequery objects: O( |Cache|/k ) RCache(q,k) • Hash table access : O(1) • Search among all the result objects:O( |Cache| ) • Search among all the objects in the database:O( |DB| ) Cache Hit Approximate Hit Cache Miss • Supposing k results are stored for each query. • |Cache| is the number of cached objects, and |DB|is the size of the database.
Approximation & Guarantees • Let the safe range be: s = r* - d( q,q* ) • The cached k* objects within distance s, are the true top-k* of the new query. • Every cached query may provide some additional guarantee. r* q* s q
Experimental setup • A collection of 1,000,000 images downloaded from Flickr: • we extracted 5 MPEG-7 descriptors, which were used to measure similarity. • A query log of 100,000 images: • a random sample with replacement,using image views • 20% training – 80% testing • k = 20, = 10 • Quality function is: • Safe range 0
What we also did ... • Take queries from a different collection. • Inject duplicates in the query log. • Use an expectation of RES as quality measure.
Approximation quality III • Bug: RES=0.07 • Portrait: RES=0.09 • Sunset: RES=0.12
Approximation quality III • Bug: RES=0.07 • Portrait: RES=0.09 • Sunset: RES=0.12
Acknowledgements • The European Project SAPIR • http://www.sapir.eu • P. Zezula and his colleagues for the M-Tree implementation • http://mufin.fi.muni.cz/tiki-index.php • The dataset used is available at • http:// cophir.isti.cnr.it