1 / 20

Query Sensitive Embeddings

Query Sensitive Embeddings. Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff. Abstract.

risa-arnold
Download Presentation

Query Sensitive Embeddings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff

  2. Abstract A common problem in many types of databases is retrieving the most similar matches to a query object. Finding those matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. This paper proposes a novel method for approximate nearest neighbor retrieval in such spaces. Our method is embedding-based, meaning that it constructs a function that maps objects into a real vector space. The mapping preserves a large amount of the proximity structure of the original space, and it can be used to rapidly obtain a short list of likely matches to the query. The main novelty of our method is that it constructs, together with the embedding, a query-sensitive distance measure that should be used when measuring distances in the vector space. The term “query sensitive“ means that the distance measure changes depending on the current query object. We report experiments with an image database of handwritten digits, and a time-series database. In both cases, the proposed method outperforms existing state-of-the-art embedding methods, meaning that it provides significantly better trade-offs between efficiency and retrieval accuracy. • Comparing a lot of high dimensional objects can be expensive. • This paper proposes a way to reduce the dimensionality/cost of the comparisons… • …by training an algorithm to give different weights to different measures depending on the query. • Tests on real-world data show better efficiency/accuracy trade-offs than the known best algorithms.

  3. Authors’ Previous Work • BoostMap: A method for efficient approximate similarity rankings. (2004) • Athitsos, Sclaroff, Kollios • Indexing multi-dimensional time-series with support for multiple distance measures. (2003) • Hadjieleftheriou

  4. 21 53 72 9 25 101 20 35 70 Terms/Concepts • Embedding • Maps any high-dimensional object into a d-dimensional vector.

  5. …more Terms/Concepts • Classifier • Given 3 points q, a & b, which is closer to q, a or b?

  6. …more Terms/Concepts • Distance Measure • Metric/Non-Metric measure of the true proximity of any 2 objects. • Metrics • How close is: • [ 2 5 72 3 5 ] to [ 5 5 45 1 1 ]? • Euclidean (L2) Distance • [ 1 3 ] to [ 5 7 ] • Manhattan (L1) Distance • |1-5| + |3-7|

  7. …more Terms/Concepts • Splitter • Returns 1 if a given object is in a particular group (defined for that splitter), 0 otherwise.

  8. Related Work • Hash/Tree Structures for High Dimensional Data • Problems? • Degrades in High Dimensions • Tree based methods rely on Euclidian/metric properties which do not hold in non-metric spaces. • AdaBoost • “Adaptive Boosting” generates new classifiers based on the failures of previous classifiers.

  9. Motivations for Query Sensitive Distance Measures • Lack of Contrasting • Two high dimensional objects are unlikely to be similar in all dimensions. 4 5 6 23 2 5 4 5 6 57 2 20 ?

  10. Motivations for Query Sensitive Distance Measures • Statistical Sensitivity • Data is rarely uniformly distributed, so for any two objects there may be relatively few coordinates that are statistically significant for that object.

  11. Simple Embeddings • It is assumed that an embedding (F) maps an object into a vector space that is significantly more efficient to measure than the ‘true’ distance (Dx)

  12. Weak Classifiers • Given a triple (q,a,b) a simple embedding F correctly classifies said triple > 50% of the time.

  13. Key to Query Sensitive Embeddings • Multiple embeddings are significantly cheaper to compute than the actual distance between two objects. • Many weak classifiers can be combined to create a strong classifier (proven in the BoostMap paper) • Each classifier can be assigned a different weight depending upon the query via splitters.

  14. Constructing an Embedding & Query Sensitive Distance Measure (via BoostMap) • Specify a large family of 1D embeddings. • Uses the embeddings to specify binary classifiers on object triples (q,a,b) • Combine the many classifiers into a single classifier H using AdaBoost. • Use H to define query sensitive embeddings and distance measures.

  15. Result: Fout & Dout • Fout is a d-dimensional embedding composed of d 1D embeddings from H. • Dout is a distance measure of vectors produced by Fout • Dout is like a weighted L1 measure, but is not symmetric or a metric, but is query-sensitive. • Because Dout is query-sensitive, running Dout & Fout through AdaBoost will always produce H.

  16. Training • Original BoostMap chose triples randomly. • Can achieve better results by choosing triples similar to the results you want to retrieve. • If you want k-nearest neighbor, choose triples (q,a,b) such that a is a k-nearest neighbor of q, but b is not.

  17. Complexity • BoostMap requires one-time training. O(mt) • Other methods require no training. • Online retrieval takes O(d) time • Cost is “similar” to other methods. • Other Methods: • FastMap • SparseMap • Metric Map

  18. Filter & Refine Retrieval • Compute the embedding for the query object and any reference/pivot objects. • Find the database objects with the most similar vectors. • Sort the results by the true distance measure.

  19. Experimental Results • Query-Sensitive Embeddings lead to better performance than embeddings using a global L1 distance measure. • Outperforms FastMap and the original BoostMap

  20. Conclusions & Further Work • Embeddings are the only family of methods that are efficient and non-specific. • How can this algorithm be applied to the choosing of a meaningful distance measure for high dimensional vectors?

More Related