Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree:A Metric Access Method for Efficient Similarity Search Tomáš Skopaltomas.skopal@vsb.czDepartment of Computer Science, VŠB-Technical University of Ostrava

Presentation Outline • Similarity search in Metric Spaces • M-tree • PM-tree • structure • range queries • hyper-ring storage • Experimental Results DATESO 2004

Similarity search in Metric Spaces • Similarity search – methods for content-based retrieval in multimedia databases (in Information Retrieval resp.) • Similarity modelled by metricd: • Restriction to metric yields a paradigmatic discrepancy with several similarity theories – nevertheless, the triangular inequality is the basic tool for metric region construction leading to an efficient similarity search • Metric queries • range query (specified by pivot object Q and covering radius rQ) • k-NN query (specified by pivot object Q and number of nearest neighbours k) DATESO 2004

Metric Access Methods • Designed to search in metric datasets in order to keep the search costs minimal (number of distance computation). • When searching large multimedia databases also the I/O search costs have to be minimized. • Many MAMs developed so far: M-tree, GH-tree, GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT, ... • Majority of the MAMs is not suitable for similarity search in large datasets (either a static method or high I/O search costs) • only M-tree and (recently) D-index are suitable candidates DATESO 2004

range query (euclidean 2D space) M-tree • dynamic, balanced, and paged metric tree (like e.g. B+-tree, R-tree) • the leaves are clusters of objects • routing entries in the inner nodes representmetric regions, recursively bounding the object clusters in leaves • during query evaluation, the triangular inequality allows discarding of irrelevantM-tree branches (metric regions resp.) DATESO 2004

PM-tree, motivation • metric regions in M-treeare unnecessarily large indexing of large portions of empty space (the “dead” space) higher probability of intersection with query region less efficient search • reduction of metric region “volume” should lead to more effective discarding of irrelevant subtrees • the way is to specify a metric region bounding all the objects more “tightly” DATESO 2004

PM-tree, structure Pivoting M-tree (PM-tree):a combination of M-tree with the pivot-based methods (LAESA-like) given a fixed set ofppivotsPi (selected from the dataset), a PM-tree region is additionaly defined byphyper-ring regions(Pi, HR[i]) each routing entry contains an array HR of p intervals <HR[i].min, HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects the more pivots, the more thightly bounded region DATESO 2004

query query PM-tree, query processing • prior to processing of a query (Q,rQ), distances d(Q, Pi) for all i ≤ p must be computed • metric region is relevant to a range query just in case that all the hyper-rings and the hyper-sphere intersect the range query region  the more hyper-rings, the lower probability of intersection with query  no additional distance computations are needed for the intersection test M-tree region PM-tree region DATESO 2004

storage of HR array Oi, r, ptr(T), ... HR[1],HR[2],...,HR[p] PM-tree, hyper-ring storage • The routing entries of PM-tree nodes are enlarged by the additional pivot-based information stored in HR arrays • To keep the space overhead minimal, a compact storage of HR[i] intervals is necessary • A distance histogram for each pivot Pi is created, and interval <dimin, dimax> is chosen such that e.g. 90% of distances in the distance histogram fall into that interval • Each value HR[i].min, HR[i].max, is scaled to the <dimin, dimax> interval using a single byte, i.e. each hyper-ring HR[i] takes 2 bytes DATESO 2004

Experimental results (synthetic) • synthetic dataset of 100,000 30-dimensional tuples distributed within 1000 clusters, L2 distance, query selectivity 50 objs. DATESO 2004

Experimental results (images) • collection of 10,000 images represented by 256-dimensional vectors (gray histograms), L2 distance, query selectivity 50 objs. DATESO 2004

Recent results(not included in proceedings) • Cost models for range queries in PM-tree (ADBIS‘04) • Experiments on image dataset (ADBIS‘04) • Optimal k-NN query algorithm for PM-tree + cost models (to be published...) DATESO 2004

Reference [1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, submitted to ADBIS 2004, Budapest, Hungary [2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná [3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, LNCS2798, Springer-Verlag, Dresden, Germany DATESO 2004

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Presentation Transcript

Seeds for Similarity Search

A Metric Cache for Similarity Search

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

Indexing similarity for efficient search in multimedia databases

Tree-based indexing methods for similarity search in metric and nonmetric spaces

R ++ -tree : an efficient spatial access method for highly redundant point data

An Efficient Video Similarity Search Algorithm

Building Efficient Time Series Similarity Search Operator

Scalable and Distributed Similarity Search in Metric Spaces

Techniques and Data Structures for Efficient Multimedia Similarity Search

Improving the Similarity Search of Tandem Mass Spectra using Metric Access Methods

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces

Similarity Search

SIMILARITY SEARCH The Metric Space Approach

M- tree: an efficient access method for similarity search in metric spaces

An Efficient Video Similarity Search Algorithm

SIMILARITY SEARCH The Metric Space Approach

Operators for Similarity Search

SIMILARITY SEARCH The Metric Space Approach

SIMILARITY SEARCH The Metric Space Approach