130 likes | 319 Views
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search. Tomáš Skopal tomas.skopal @vsb.cz Department of Computer Science, V ŠB-Technical University of Ostrava. Presentation Outline. Similarity search in Metric Spaces M-tree PM-tree structure range queries
E N D
Pivoting M-tree:A Metric Access Method for Efficient Similarity Search Tomáš Skopaltomas.skopal@vsb.czDepartment of Computer Science, VŠB-Technical University of Ostrava
Presentation Outline • Similarity search in Metric Spaces • M-tree • PM-tree • structure • range queries • hyper-ring storage • Experimental Results DATESO 2004
Similarity search in Metric Spaces • Similarity search – methods for content-based retrieval in multimedia databases (in Information Retrieval resp.) • Similarity modelled by metricd: • Restriction to metric yields a paradigmatic discrepancy with several similarity theories – nevertheless, the triangular inequality is the basic tool for metric region construction leading to an efficient similarity search • Metric queries • range query (specified by pivot object Q and covering radius rQ) • k-NN query (specified by pivot object Q and number of nearest neighbours k) DATESO 2004
Metric Access Methods • Designed to search in metric datasets in order to keep the search costs minimal (number of distance computation). • When searching large multimedia databases also the I/O search costs have to be minimized. • Many MAMs developed so far: M-tree, GH-tree, GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT, ... • Majority of the MAMs is not suitable for similarity search in large datasets (either a static method or high I/O search costs) • only M-tree and (recently) D-index are suitable candidates DATESO 2004
range query (euclidean 2D space) M-tree • dynamic, balanced, and paged metric tree (like e.g. B+-tree, R-tree) • the leaves are clusters of objects • routing entries in the inner nodes representmetric regions, recursively bounding the object clusters in leaves • during query evaluation, the triangular inequality allows discarding of irrelevantM-tree branches (metric regions resp.) DATESO 2004
PM-tree, motivation • metric regions in M-treeare unnecessarily large indexing of large portions of empty space (the “dead” space) higher probability of intersection with query region less efficient search • reduction of metric region “volume” should lead to more effective discarding of irrelevant subtrees • the way is to specify a metric region bounding all the objects more “tightly” DATESO 2004
PM-tree, structure Pivoting M-tree (PM-tree):a combination of M-tree with the pivot-based methods (LAESA-like) given a fixed set ofppivotsPi (selected from the dataset), a PM-tree region is additionaly defined byphyper-ring regions(Pi, HR[i]) each routing entry contains an array HR of p intervals <HR[i].min, HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects the more pivots, the more thightly bounded region DATESO 2004
query query PM-tree, query processing • prior to processing of a query (Q,rQ), distances d(Q, Pi) for all i ≤ p must be computed • metric region is relevant to a range query just in case that all the hyper-rings and the hyper-sphere intersect the range query region the more hyper-rings, the lower probability of intersection with query no additional distance computations are needed for the intersection test M-tree region PM-tree region DATESO 2004
storage of HR array Oi, r, ptr(T), ... HR[1],HR[2],...,HR[p] PM-tree, hyper-ring storage • The routing entries of PM-tree nodes are enlarged by the additional pivot-based information stored in HR arrays • To keep the space overhead minimal, a compact storage of HR[i] intervals is necessary • A distance histogram for each pivot Pi is created, and interval <dimin, dimax> is chosen such that e.g. 90% of distances in the distance histogram fall into that interval • Each value HR[i].min, HR[i].max, is scaled to the <dimin, dimax> interval using a single byte, i.e. each hyper-ring HR[i] takes 2 bytes DATESO 2004
Experimental results (synthetic) • synthetic dataset of 100,000 30-dimensional tuples distributed within 1000 clusters, L2 distance, query selectivity 50 objs. DATESO 2004
Experimental results (images) • collection of 10,000 images represented by 256-dimensional vectors (gray histograms), L2 distance, query selectivity 50 objs. DATESO 2004
Recent results(not included in proceedings) • Cost models for range queries in PM-tree (ADBIS‘04) • Experiments on image dataset (ADBIS‘04) • Optimal k-NN query algorithm for PM-tree + cost models (to be published...) DATESO 2004
Reference [1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, submitted to ADBIS 2004, Budapest, Hungary [2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná [3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, LNCS2798, Springer-Verlag, Dresden, Germany DATESO 2004