210 likes | 312 Views
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join. Feature Based Similarity. Simple Similarity Queries. Specify query object and Find similar objects – range query
E N D
Christian Böhm & Hans-Peter Kriegel,Ludwig Maximilians Universität MünchenA Cost Model and Index Architecture for the Similarity Join
Simple Similarity Queries • Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q.
R S Join Applications: Catalogue Matching • Catalogue matching • E.g. Astronomic catalogues
Join Applications: Clustering • Clustering (e.g. DBSCAN) • Similarity self-join
e R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, e)if IsDirpg (R) Ù IsDirpg (S) thenforeachrÎR.children do foreachsÎS.children doif mindist (r,s) £ethen CacheLoad(r); CacheLoad(s);r_tree_sim_join (r,s,e) ;else (* assume R,S both DataPg *)foreachpÎR.points do foreachqÎS.points do if |p - q| £ ethen report (p,q); R S
Cost Modeling • Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum
Cost Modeling • Binomial formula:
Cost Modeling • Mating probability of index pages: • Probability that distance between two pages £ e • Two-fold application of Minkowski sum
Page Capacity Optimization • Cost model can determine index selectivity which depends on various parameters • Page capacity (number of stored points) is an important parameter • Known from similarity search: Page capacity optimization yields considerable improvement
Analysis of the Index Overhead • Assuming 100% selectivity (index doesnt work)How much more expensive is index usage ? • CPU: • Distance betw. boxes more expensive to compute than distance betw. points: a » 5 • Smaller capacity more box distance computations
Analysis of the Index Overhead • Disk I/O: • High constant cost per page access (move disk head) • Page access is by factor b » 10000 / d more expensive than continuous reading of a point • Smaller capacity more disk head movement
Analysis of the Index Overhead • What selectivity is needed that index pays off ?
Optimization • I/O cost function:is optimized by • CPU cost function:is optimized by:
Optimization • I/O cost: • Large capacity optimum (several 10,000 points, typically) • CPU cost: • Small capacity optimum (< 100 points, typically) • No compromise achievable
Multipage Index (MuX) ® CPU-performance like CPU optimized index ® I/O- performance like I/O optimized index separate optimization
Experimental Evaluation Uniform 4D Uniform 8D
Experimental Evaluation CAD Data 16D Color Images 64D
Conclusions • Summary • High potential for performance gains of the similarity join by page capacity optimization • Necessary to separately optimize I/O and CPU • Future research potential • Similarity join for metric index structures • Approximate similarity join • Parallel similarity join algorithms
Consequences • Assume for I/O optimization selectivity » 100% • Page accesses in a nested block loop like style: fill cache with pages of R (1 page free) ; foreachS-page sdo ifs joins some of the cached R-pg then load (s) ; foreach joining R-page r in cache do if mindist(r,s) < ethen join (r,s) ;
e R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, e)if IsDirpg (R) Ù IsDirpg (S) thenforeachrÎR.children do foreachsÎS.children doif mindist (r,s) £ethen CacheLoad(r); CacheLoad(s);r_tree_sim_join (r,s,e) ;else (* assume R,S both DataPg *)foreachpÎR.points do foreachqÎS.points do if |p - q| £ ethen report (p,q); R S