150 likes | 363 Views
Scalable and Distributed Similarity Search in Metric Spaces. Michal Batko Claudio Gennaro Pavel Zezula. Presentation contents. Motivation Metric spaces and similarity searching GHT* Concepts Generalized Hyperplane Tree Distributed architecture Experimental results
E N D
Scalable and Distributed Similarity Searchin Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula
Presentation contents • Motivation • Metric spaces and similarity searching • GHT* • Concepts • Generalized Hyperplane Tree • Distributed architecture • Experimental results • Conclusions and future work
Motivation • Searching is a fundamental problem • Traditional search • Numbers or strings • Based on total linear order of keys • New approach • Free text, images, audio, video, etc. • Impossible to structure in keys and records
Alternative Similarity searching Metric spaces
Metric space • Set of objects (A) • any class of objects, which allows distance computing • for example text, audio or video files • Metric function (d) • positive • reflexive • symmetric • triangle inequality
r Q 1 Q 3 2 4 Similarity searching • Range search • objects at max distance rfrom object Q • k-nearest neighbor search • k nearest neighbor objects of object Q
GHT* – concepts • Data distributed among servers • Multiple buckets with limited capacity • Clients perform updates and search • Bucket location algorithm • Based on DDH and DST algorithms • Exploits Generalized Hyperplane Tree
p3 p12 p2 p2 p4 p10 p11 p6 p7 p9 p2 p5 p13 p1 p5 p5 p2 p4 p6 p12 p10 p9 p8 p5 p3 p7 p11 p13 p14 p1 p8 Generalized Hyperplane Tree • Single-site metric space indexing structure • Allows similarity searching and is scalable • Binary search tree • Data stored in leaf nodes • Inner nodes for routing • Two “pivots” per node P14
GHT* – distributed architecture • GHT is used as search structure • Leaf node represents a server • unique server identifier • servers extend the tree with leaf nodes for their local buckets • Inner nodes store routing information • GHT is replicated • GHT can be inaccurate • Update (image adjustment) messages
Experimental results – inserting • Preliminary phase • Tests for vector space with Euclidean distance function
Experimental results – searching 20 range queries with radius 50 points (match approx. 3 objects)
Conclusions • First structure for scalable distributed similarity search • Satisfies properties of SDDS • Scalability – can expand to new servers through autonomous splits • No hot-spot – all clients use as precise addressing as possible and learn from misaddressing • Updates are local and never require updates to multiple clients • Client performs only a few distance computations to locate servers
Future work • More experiments • Different metric spaces • More complex evaluation • Additional evaluated properties • Nearest neighbor search • Algorithm for parallel processing to better utilize distributed structure • Experimental evaluation