270 likes | 455 Views
Tree-based indexing methods for similarity search in metric and nonmetric spaces. Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Mgr. Jakub Lokoč Supervisor: Doc. RNDr . Tom áš Skopal , Ph.D. Presentation outline. Introduction
E N D
Tree-based indexing methods for similarity search in metric and nonmetric spaces Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Mgr. Jakub Lokoč Supervisor: Doc. RNDr. TomášSkopal, Ph.D. MFF UK, Prague
Presentation outline • Introduction • Similarity search • M-tree • Contributions & Results • Metric search • Nonmetric search • Outlook MFF UK, Prague
query object Similarity search • How to search in large collections of unstructured data? • We cannot use relation databases or textual annotation • Content based similarity searching • Similarity→ distance functionδ→metric vs. nonmetricsearch • Feature extraction→ feature space • Problems of similarity searching • Effectivity → selection of complex descriptors and (often expensive) distance function (not DB problem) • Efficiency → indexing → exact vs. approximate search Feature extraction Similarity evaluation Feature extraction MFF UK, Prague
Similarity search -variants of δ • δ is metric • Allows indexing by metric access methods (e.g., M-tree) • Objects can be organized into separate clusters • δ is nonmetric • Robust similarity functions suitable for domain experts • Not constrained by metric axioms, but only approximate search by metric access methods • In our work, we have focused onFASTsimilarity search in metric and nonmetric spaces by M-tree MFF UK, Prague
range query Q (euclidean 2D space) M-tree • Structure and properties • Dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • The leaves are clusters of indexed objectsOj(ground objects) • Routing entries in the inner nodes represent hyper-spherical metric regions (Oi,rOi), recursively bounding the object clusters in leaves • The triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation
Contributions to M-tree • New construction techniques • Forced reinserting • Hybridway leaf selection • Parallel dynamic batch loading • Nonmetric search • M-tree variant - NM-tree MFF UK, Prague
O5 O3 O1 O7 O9 O4 O5 O1 Forcedreinserting • Insert new object O11 • Remove O8, O6 and insert them into the stack • Decrease region’s radius (to O11) • Insert O6 from the stack • Remove O2 and insert in the stack • Decrease region’s radius (to O6) • Insert O2 from the stack • Insert O8 from the stack O4 O6 O1 O3 O11 O11 O5 O2 O7 STACK O8 O9 O10 O2 O8 O6 O9 O10
Hybridway leaf selection • First phase of inserting = find suitable leaf for new OBJ • Classic selection strategies • Singleway – fast indexing, less compact hierarchy • Multiway – vice versa • Our approach • User controls how many branches are visited • Finds suboptimal leaf node • May return full leaf node MFF UK, Prague
Experimental results CoPhIR (color layout and structure), dim 76, dbSize250.000 MFF UK, Prague
Parallel dynamic batch loading 1. Aggregation 2. Parallel batch loading 3. Traditional inserting Not inserted objects “Split generating” – will be inserted in traditional way (exploiting limited parallelism) Postponed – will be inserted during the next batch • To find scalability bottlenecks we measured • Parallel batch loading time – PI • Traditional inserts causing split time – ICS • Traditional inserts not causing split time – INCS
Experimental results CoPhIR 1.000.000 Dimension 76 (12 + 64) L5.123456 distance 24 / 25 inner/leaf node size 512MB cache size
Nonmetric search • Metric properties – too restrictive • Triangle inequality is the most attacked one • Semimetric distances (e.g. in molecular biology) • But, how to search efficiently? Identity Non-negativity Symmetry Triangle inequality 2NN ( ) = { , } 2NN ( ) = { , } MFF UK, Prague
Nonmetricsearch • Relatedwork • MAMs can employ a semimetricdS for approximate search • Semimetric behavior can be tuned by transformation functions f(e.g., we can turn semimetric to metric dM = fM(dS)) • More metric behavior – more precise, but slower search • Less metric behavior – less precise, but faster search • However, M-tree is fixed to employed (semi)metric (black-box distance) MFF UK, Prague
NM-tree • The trick • We use inversely symmetric transformation functions - dS = f-1 ( f ( dS) ) • fei and fM are evaluated in initial phase • We index data using dM = fM(dS) (to allow exact searching) • Stored distances dM can be transformed back to dS = fM-1(dM) • Retrieval precision ei at query time • dei = fei(fM-1(fM(dS))) or just dei = fei(dS) • Metric search in upper levels (by dM) MFF UK, Prague
Experimental results MFF UK, Prague
Outlook • Metric search • Combination of more sophisticated M-tree constructions techniques and parallelism • Adopting the techniques to M-tree descendants • Employ as a dynamic clustering technique • Nonmetric search • Finding better „nonmetric to metric“ transformation functions • Reuse other MAMs for nonmetric search MFF UK, Prague
References Ciaccia, P., Patella, M., and Zezula, P. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces VLDB1997 Zezula, P., Savino, P., Rabitti, F., Amato, G., and Ciaccia, P. Processing M-Tree with Parallel Resources EDBT 1998 Skopal, T., Pokorny, J., Kratky, M., and Snasel, V. Revisiting M-tree Building Principles ADBIS 2003, LNCS 2798, Springer Skopal T. Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces TODS 2007, ACM MFF UK, Prague
Publications Lokoc J. and SkopalT. On Reinsertions in M-tree SISAP 2008, IEEE SkopalT. and Lokoc J. NM-Tree: Flexible Approximate SimilaritySearch in Metric and Non-metric Spaces DEXA 2008, LNCS 5181, Springer Skopal T. and Lokoc J. New Dynamic Construction Techniquesfor M-tree JournalofDiscreteAlgorithms, Elsevier 2009 Lokoc J. Parallel Dynamic Batch Loading in the M-tree SISAP 2009, IEEE J. Novák, T. Skopal, D. Hoksza, J. Lokoč Improving the Similarity Search of Tandem Mass Spectra using Metric Access Methods SISAP 2010, ACM J. Lokoč, T. Skopal On Applications of Parameterized Hyperplane Partitioning SISAP 2010, ACM T. Skopal, J. Lokoč Answering Metric Skyline Queries by PM-tree DATESO 2010, CEUR • T. Skopal, J. Lokoč, B. Bustos • D-cache: Universal Distance Cache for Metric Access Methods • Major revision, Transactions on Knowledge and Data Engineering MFF UK, Prague
Citations • Lokoč, J. and Skopal, T. 2008. On Reinsertions in M-tree. In SISAP ’08: Proceedings of the First International Workshop on Similarity Search and Applications. IEEE Computer Society, Washington, DC, USA, 121–128. • Roberto UribeParedes, Gonzalo Navarro. EGNAT: A Fully Dynamic Metric Access Method for Secondary Memory. In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, p.57-64, August 29-30, 2009, Prague, Czech Republic • Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010) • Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 MFF UK, Prague
Citations • Skopal, T. and Lokoč, J. 2009. New Dynamic Construction Techniques for M-tree. Journal of Discrete Algorithms, Elsevier 7 (1): 62–77. • Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010) • Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 • Kaster D., Bueno R., Bugatti P., Traina A., Traina C. Jr., Incorporating Metric Access Methods for Similarity Searching on Oracle Database, SBBD 2009 MFF UK, Prague
Citations • Lokoč, J. 2009 Parallel Dynamic Batch Loading in the M-tree, In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, pp.117-123, August 29-30, 2009, Prague, Czech Republic • QiuC. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 MFF UK, Prague
Thank for your attention MFF UK, Prague
Answers (V. Dohnal) • σmax is not defined • σmax is maximal distance in the distance space • Similarity join (SJ) is not a multiexample query type • I agree - SJ is rather complex operator consisting of multiple single example queries • What other costs must be taken into account • In the case a distance function is cheap (e.g. Lp metrics), we have to take into account internal overhead of a particular MAM (e.g. pivot space filtering in pivot tables) • Missing database size for figure 1.14 • DbSize = 100.000 MFF UK, Prague
Answers (V. Dohnal) • How to solve leaf node overflows during stack processing in conservative resinsertions • We perform regular split • If HW leaf selection is unsuccessful, SW leaf selection is used. Does SW leaf selection employ pre-computed distances from HW? • We do not use distances from HW leaf selection since HW leaf selection is usually successful and hence we have left the algorithm simple (which reduces internal CPU costs) • Moreover, it can be solved by the D-cache (see publications) MFF UK, Prague
Answers (V. Dohnal) • How is changed the number of dimensions (x axis) in figure 3.6 • We have used 76 dim concatenated vector of two features (12 + 64), we used a “prefixes” of this vector • What causes fluctuations to query costs in figure 3.9 • Reinserting behavior is chaotic with respect to increasing number of removed objects • Radius change can be propagated to the upper levels of the M-tree, how is this process synchronized? • Radius is not propagated to upper levels (to improve parallel performance) – but it is a topic of our future work MFF UK, Prague
Answers (V. Dohnal) • What algorithms have been used during the first two steps of the parallel batch loading iteration? • In the first step, we have just used simple list for new objects aggregation. In the second step, each thread used SW leaf selection using exclusive locks for radius updates. • What is the motivation for random heuristic? • Random heuristic can be faster in the case, the distance measure is cheaper. Moreover, we wanted to test, whether randomly selected objects perform more splits. • DB size is 1.000.000, batch size is 200, why is the number of iterations > 5000 • It is caused by the fact, that not all objects from the batch are inserted during one iteration. • ICS and INCS stand for the number of real insertions (ICS = number of leaf node splits) • What is residue time? • Residue aggregates realtime overhead and I/O cost. All other comments will be updated for online version and I thank for them MFF UK, Prague