1 / 25

A Metric Cache for Similarity Search

A Metric Cache for Similarity Search. fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego. Content-Based Image Retrieval. Similarity Search in Databases. Objects are “unknown” , only distances are “well known” Metric Space assumption: Identity

zenda
Download Presentation

A Metric Cache for Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Metric Cache forSimilarity Search fabriziofalchiclaudiolucchesesalvatoreorlandofaustorabittiraffaeleperego

  2. Content-BasedImage Retrieval Similarity Search in Databases • Objects are “unknown”, only distances are “well known” • Metric Space assumption: • Identity • Symmetry • Triangular inequality • Distance functions include: • Minkowski distances, edit and Jaccard distance, ... • Applications include: • Images, 3D shapes, medical data, text, dna sequences, graphs, etc. • Metric space indexing works betterthan multidimensional indexing. query:

  3. Distributed Similarity Search System Top-K queries Front-end of the CBIR System Search cost is close to O( |DB| ) !!!! Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n

  4. & Cached Distributed Similarity Search System Top-K queries Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n

  5. & Cached Distributed Similarity Search System What’s different in ? Top-K queries Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System Index of MM objects Unit 1 Index of MM objects Unit 2 Index of MM objects Unit n

  6. What’s different in ? Metric Cache • The cache stores result-objects, not only result-pointers • e.g.: documents vs. documents ids • The cache is a peculiar sample of the whole dataset • the set of objects most recently seenby the users (= most interesting !?!) • Claim: An interesting object may be used to answer approximately if it is sufficiently similar to the query.

  7. What’s different in ? Metric Cache …and… • Queries may be approximate ! • [Zobel et al. CIVR 07]At least 8% of the images in the web are near-duplicates. Most of them are due to cropping, contrast adjustment, etc. • Requirement: the system must be robust w.r.t. near-duplicate queries.

  8. What’s different in ? Metric Cache q1 q2 q3 Approx. answer Exact answer Exact answer Metric Cache Front-end of the CBIR System Parallel & DistributedCBIR System

  9. Two algorithms: RCache vs. QCache RCache(q,k) • IfqCachereturnR • R = Cache.knn(q,k) • If quality(R) > returnR • else R = DB.knn(q,k) Cache.add(q,R) returnR Cache Hit Approximate Hit Cache Miss • In case of approximate hit, the cached query q’, being the closest to q, is marked as used. • The Least Recently Used query and its results are evicted.

  10. Costs of RCache vs. QCache RCache(q,k) • Hash table access : O(1) • Search among all the result objects:O( |Cache| ) • Search among all the objects in the database:O( |DB| ) Cache Hit Approximate Hit Cache Miss • |Cache| is the number of cached objects, and |DB|is the size of the database.

  11. Two algorithms: RCache vs. QCache • QCache(q,k) • IfqCachereturnR • Q* = Cache.knn(q, ) • R* = {results in Q*} • R = R*.knn(q,k) • If quality(R) > • returnR • else • R = DB.knn(q,k) • Cache.add(q,R) • returnR RCache(q,k) • IfqCachereturnR • R = Cache.knn(q,k) • If quality(R) > returnR • else R = DB.knn(q,k) Cache.add(q,R) returnR Cache Hit Approximate Hit Cache Miss • In case of approximate hit, the cached query q’, being the closest to q, is marked as used. • The Least Recently Used query and its results are evicted.

  12. Costs of RCache vs. QCache • QCache(q,k) • Search among thequery objects: O( |Cache|/k ) RCache(q,k) • Hash table access : O(1) • Search among all the result objects:O( |Cache| ) • Search among all the objects in the database:O( |DB| ) Cache Hit Approximate Hit Cache Miss • Supposing k results are stored for each query. • |Cache| is the number of cached objects, and |DB|is the size of the database.

  13. Approximation & Guarantees • Let the safe range be: s = r* - d( q,q* ) • The cached k* objects within distance s, are the true top-k* of the new query. • Every cached query may provide some additional guarantee. r* q* s q

  14. Experimental setup • A collection of 1,000,000 images downloaded from Flickr: • we extracted 5 MPEG-7 descriptors, which were used to measure similarity. • A query log of 100,000 images: • a random sample with replacement,using image views • 20% training – 80% testing • k = 20,  = 10 • Quality function is: • Safe range  0

  15. Hit ratio

  16. Throughput

  17. Approximation quality I

  18. Approximation quality II

  19. What we also did ... • Take queries from a different collection. • Inject duplicates in the query log. • Use an expectation of RES as quality measure.

  20. Approximation quality III • Bug: RES=0.07 • Portrait: RES=0.09 • Sunset: RES=0.12

  21. Approximation quality III • Bug: RES=0.07 • Portrait: RES=0.09 • Sunset: RES=0.12

  22. Acknowledgements • The European Project SAPIR • http://www.sapir.eu • P. Zezula and his colleagues for the M-Tree implementation • http://mufin.fi.muni.cz/tiki-index.php • The dataset used is available at • http:// cophir.isti.cnr.it

  23. Thank you.

  24. Backup slides

More Related