Tree-based indexing methods for similarity search in metric and nonmetric spaces

Tree-based indexing methods for similarity search in metric and nonmetric spaces Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Mgr. Jakub Lokoč Supervisor: Doc. RNDr. TomášSkopal, Ph.D. MFF UK, Prague

Presentation outline • Introduction • Similarity search • M-tree • Contributions & Results • Metric search • Nonmetric search • Outlook MFF UK, Prague

query object Similarity search • How to search in large collections of unstructured data? • We cannot use relation databases or textual annotation • Content based similarity searching • Similarity→ distance functionδ→metric vs. nonmetricsearch • Feature extraction→ feature space • Problems of similarity searching • Effectivity → selection of complex descriptors and (often expensive) distance function (not DB problem) • Efficiency → indexing → exact vs. approximate search Feature extraction Similarity evaluation Feature extraction MFF UK, Prague

Similarity search -variants of δ • δ is metric • Allows indexing by metric access methods (e.g., M-tree) • Objects can be organized into separate clusters • δ is nonmetric • Robust similarity functions suitable for domain experts • Not constrained by metric axioms, but only approximate search by metric access methods • In our work, we have focused onFASTsimilarity search in metric and nonmetric spaces by M-tree MFF UK, Prague

range query Q (euclidean 2D space) M-tree • Structure and properties • Dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • The leaves are clusters of indexed objectsOj(ground objects) • Routing entries in the inner nodes represent hyper-spherical metric regions (Oi,rOi), recursively bounding the object clusters in leaves • The triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation

Contributions to M-tree • New construction techniques • Forced reinserting • Hybridway leaf selection • Parallel dynamic batch loading • Nonmetric search • M-tree variant - NM-tree MFF UK, Prague

O5 O3 O1 O7 O9 O4 O5 O1 Forcedreinserting • Insert new object O11 • Remove O8, O6 and insert them into the stack • Decrease region’s radius (to O11) • Insert O6 from the stack • Remove O2 and insert in the stack • Decrease region’s radius (to O6) • Insert O2 from the stack • Insert O8 from the stack O4 O6 O1 O3 O11 O11 O5 O2 O7 STACK O8 O9 O10 O2 O8 O6 O9 O10

Hybridway leaf selection • First phase of inserting = find suitable leaf for new OBJ • Classic selection strategies • Singleway – fast indexing, less compact hierarchy • Multiway – vice versa • Our approach • User controls how many branches are visited • Finds suboptimal leaf node • May return full leaf node MFF UK, Prague

Experimental results CoPhIR (color layout and structure), dim 76, dbSize250.000 MFF UK, Prague

Parallel dynamic batch loading 1. Aggregation 2. Parallel batch loading 3. Traditional inserting Not inserted objects “Split generating” – will be inserted in traditional way (exploiting limited parallelism) Postponed – will be inserted during the next batch • To find scalability bottlenecks we measured • Parallel batch loading time – PI • Traditional inserts causing split time – ICS • Traditional inserts not causing split time – INCS

Experimental results CoPhIR 1.000.000 Dimension 76 (12 + 64) L5.123456 distance 24 / 25 inner/leaf node size 512MB cache size

Nonmetric search • Metric properties – too restrictive • Triangle inequality is the most attacked one • Semimetric distances (e.g. in molecular biology) • But, how to search efficiently? Identity Non-negativity Symmetry Triangle inequality 2NN ( ) = { , } 2NN ( ) = { , } MFF UK, Prague

Nonmetricsearch • Relatedwork • MAMs can employ a semimetricdS for approximate search • Semimetric behavior can be tuned by transformation functions f(e.g., we can turn semimetric to metric dM = fM(dS)) • More metric behavior – more precise, but slower search • Less metric behavior – less precise, but faster search • However, M-tree is fixed to employed (semi)metric (black-box distance) MFF UK, Prague

NM-tree • The trick • We use inversely symmetric transformation functions - dS = f-1 ( f ( dS) ) • fei and fM are evaluated in initial phase • We index data using dM = fM(dS) (to allow exact searching) • Stored distances dM can be transformed back to dS = fM-1(dM) • Retrieval precision ei at query time • dei = fei(fM-1(fM(dS))) or just dei = fei(dS) • Metric search in upper levels (by dM) MFF UK, Prague

Experimental results MFF UK, Prague

Outlook • Metric search • Combination of more sophisticated M-tree constructions techniques and parallelism • Adopting the techniques to M-tree descendants • Employ as a dynamic clustering technique • Nonmetric search • Finding better „nonmetric to metric“ transformation functions • Reuse other MAMs for nonmetric search MFF UK, Prague

References Ciaccia, P., Patella, M., and Zezula, P. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces VLDB1997 Zezula, P., Savino, P., Rabitti, F., Amato, G., and Ciaccia, P. Processing M-Tree with Parallel Resources EDBT 1998 Skopal, T., Pokorny, J., Kratky, M., and Snasel, V. Revisiting M-tree Building Principles ADBIS 2003, LNCS 2798, Springer Skopal T. Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces TODS 2007, ACM MFF UK, Prague

Publications Lokoc J. and SkopalT. On Reinsertions in M-tree SISAP 2008, IEEE SkopalT. and Lokoc J. NM-Tree: Flexible Approximate SimilaritySearch in Metric and Non-metric Spaces DEXA 2008, LNCS 5181, Springer Skopal T. and Lokoc J. New Dynamic Construction Techniquesfor M-tree JournalofDiscreteAlgorithms, Elsevier 2009 Lokoc J. Parallel Dynamic Batch Loading in the M-tree SISAP 2009, IEEE J. Novák, T. Skopal, D. Hoksza, J. Lokoč Improving the Similarity Search of Tandem Mass Spectra using Metric Access Methods SISAP 2010, ACM J. Lokoč, T. Skopal On Applications of Parameterized Hyperplane Partitioning SISAP 2010, ACM T. Skopal, J. Lokoč Answering Metric Skyline Queries by PM-tree DATESO 2010, CEUR • T. Skopal, J. Lokoč, B. Bustos • D-cache: Universal Distance Cache for Metric Access Methods • Major revision, Transactions on Knowledge and Data Engineering MFF UK, Prague

Citations • Lokoč, J. and Skopal, T. 2008. On Reinsertions in M-tree. In SISAP ’08: Proceedings of the First International Workshop on Similarity Search and Applications. IEEE Computer Society, Washington, DC, USA, 121–128. • Roberto UribeParedes, Gonzalo Navarro. EGNAT: A Fully Dynamic Metric Access Method for Secondary Memory. In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, p.57-64, August 29-30, 2009, Prague, Czech Republic • Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010) • Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 MFF UK, Prague

Citations • Skopal, T. and Lokoč, J. 2009. New Dynamic Construction Techniques for M-tree. Journal of Discrete Algorithms, Elsevier 7 (1): 62–77. • Marcos R. Vieira, Fabio J. T. Chino, Agma J. M. Traina, Caetano Traina Jr. Revisiting the DBM-Tree. Journal of Information and Data Management, Vol 1, No 1 (2010) • Qiu C. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 • Kaster D., Bueno R., Bugatti P., Traina A., Traina C. Jr., Incorporating Metric Access Methods for Similarity Searching on Oracle Database, SBBD 2009 MFF UK, Prague

Citations • Lokoč, J. 2009 Parallel Dynamic Batch Loading in the M-tree, In SISAP ’09: Proceedings of the Second International Workshop on Similarity Search and Applications, pp.117-123, August 29-30, 2009, Prague, Czech Republic • QiuC. et al. A Parallel Bulk Loading Algorithm for M-tree on Multi-core CPUs, International Joint Conference on Computational Sciences and Optimization, IEEE, 2010 MFF UK, Prague

Thank for your attention MFF UK, Prague

Answers (V. Dohnal) • σmax is not defined • σmax is maximal distance in the distance space • Similarity join (SJ) is not a multiexample query type • I agree - SJ is rather complex operator consisting of multiple single example queries • What other costs must be taken into account • In the case a distance function is cheap (e.g. Lp metrics), we have to take into account internal overhead of a particular MAM (e.g. pivot space filtering in pivot tables) • Missing database size for figure 1.14 • DbSize = 100.000 MFF UK, Prague

Answers (V. Dohnal) • How to solve leaf node overflows during stack processing in conservative resinsertions • We perform regular split • If HW leaf selection is unsuccessful, SW leaf selection is used. Does SW leaf selection employ pre-computed distances from HW? • We do not use distances from HW leaf selection since HW leaf selection is usually successful and hence we have left the algorithm simple (which reduces internal CPU costs) • Moreover, it can be solved by the D-cache (see publications) MFF UK, Prague

Answers (V. Dohnal) • How is changed the number of dimensions (x axis) in figure 3.6 • We have used 76 dim concatenated vector of two features (12 + 64), we used a “prefixes” of this vector • What causes fluctuations to query costs in figure 3.9 • Reinserting behavior is chaotic with respect to increasing number of removed objects • Radius change can be propagated to the upper levels of the M-tree, how is this process synchronized? • Radius is not propagated to upper levels (to improve parallel performance) – but it is a topic of our future work MFF UK, Prague

Answers (V. Dohnal) • What algorithms have been used during the first two steps of the parallel batch loading iteration? • In the first step, we have just used simple list for new objects aggregation. In the second step, each thread used SW leaf selection using exclusive locks for radius updates. • What is the motivation for random heuristic? • Random heuristic can be faster in the case, the distance measure is cheaper. Moreover, we wanted to test, whether randomly selected objects perform more splits. • DB size is 1.000.000, batch size is 200, why is the number of iterations > 5000 • It is caused by the fact, that not all objects from the batch are inserted during one iteration. • ICS and INCS stand for the number of real insertions (ICS = number of leaf node splits) • What is residue time? • Residue aggregates realtime overhead and I/O cost. All other comments will be updated for online version and I thank for them MFF UK, Prague

Tree-based indexing methods for similarity search in metric and nonmetric spaces

Tree-based indexing methods for similarity search in metric and nonmetric spaces

Presentation Transcript

Metric based KNN indexing

Cover Trees For Nearest Neighbour Search in Metric Spaces

A Metric Cache for Similarity Search

Similarity Search on Bregman Divergence, Towards Non-Metric Indexing

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

Indexing similarity for efficient search in multimedia databases

Multi -Attribute Spaces: Calibration for Attribute Fusion and Similarity Search

E fficient similarity search in metric and nonmetric spaces

Tree-based Indexing

Scalable and Distributed Similarity Search in Metric Spaces

Tree-based Indexing

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces

Content-Based Similarity Search

Hierarchical Indexing Structure for Efficient Similarity Search in Video Retrieval

SIMILARITY SEARCH The Metric Space Approach

M- tree: an efficient access method for similarity search in metric spaces

SIMILARITY SEARCH The Metric Space Approach

SIMILARITY SEARCH The Metric Space Approach

Gene Prediction: Similarity-Based Methods

SIMILARITY SEARCH The Metric Space Approach