180 likes | 325 Views
Nearest Neighbours Search using the PM-tree. Tomáš Skopal 1 Jaroslav Pokorn ý 1 Václav Snášel 2. 1 Charles University in Prague Department of Software Engineering Czech Republic. 2 VSB - Technical University of Ostrav a Department of Computer Science Czech Republic.
E N D
Nearest Neighbours Search using the PM-tree Tomáš Skopal1 Jaroslav Pokorný1 Václav Snášel2 1Charles University in PragueDepartment of Software Engineering Czech Republic 2VSB - Technical University ofOstravaDepartment of Computer ScienceCzech Republic DASFAA 2005, Beijing
Presentation Outline • Similarity search in Metric Spaces • M-tree • the structure • k-NN search • PM-tree(an extension of M-tree) • motivation • the structure • k-NN search • Experimental Results DASFAA 2005, Beijing
Similarity search in Metric Spaces • Similarity search • methods for content-based retrieval in multimedia databases • the similarity measure is often modelled by a metricd(satisfying triangular inequality, symmetry, reflexivity, non-negativity) • similarity queries (query by example)realized as metric queries • range query (Q, rQ) (specified by a query object Q and covering radius rQ) • k-NN query(Q, k) (specified by a query object Q and number of nearest neighbours k) Metric Access Methods (MAMs) • designed to search in metric datasets in order to keep the search costs minimal • search costs = number of distance computations + I/O costs • only distances between objects are used for indexing (the structure of object representation is not used for indexing) • many MAMs are not suitable for similarity search in large datasets • either a static method or high I/O search costs • M-tree and (recently) D-index are the only suitable candidates so far DASFAA 2005, Beijing
range query Q (euclidean 2D space) M-tree (metric tree) • dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • the leaves are clusters of indexed objects Oj (ground objects) • routing entries in the inner nodes represent hyper-spherical metric regions (Oi , rOi), recursively bounding the object clusters in leaves • the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.) during query evaluation DASFAA 2005, Beijing
k-NN search in the M-tree • branch-and-bound algorithm(similar to that of R-tree) • modification of range query algorithm, but the query radius rQ is dynamic • rQ decreasing from infinity to the distance to the k-th neighbour • utilized two structures: priority queue PR and sorted array NN • PR: stores requests for nodes not-filtered from the search yet • request of form [routing entry to a node N, dmin(N)], where dmin(N) is the lower bound distancefrom Q to all possible objects in N, i.e. dmin(rout. entry to N) = max {0 , d(Q , Oi)– rOi} where (Oi , rOi )is region of the N’s routing entry; (requests in PR sorted by dmin(N)) • NN: stores k candidate objects (or distance upper bounds) • at the end of algorithm run, NN contains the result, i.e. the k nearest neighbours • entry of form [candidate object Oi, d(Q,Oi)] or [ - , dmax(N)], where dmax(·) is the upper bound distance from Q to all possible objects in N, i.e dmax(rout. entry to N) = d(Q , Oi)+ rOi • PR stores only requests with dmin(·) < dmax(·), other requests are removed from PR • i.e. such requests are removed, which do not overlap the dynamic query region (Q , rQ) Query processing: the requests in PR are processed in FIFO manner → a node N is retrieved, while PR and NN structures are updates by routing/ground entries of N • PR is initialized to ([root,∞] ), NN is initialized by k entries [-,∞] to ( [- ,∞] , [- ,∞] , ... ) • optimal in I/O costs(the same I/O costs as range query (Q , d(Q , NN[5])) ) DASFAA 2005, Beijing
rQ = ∞ dmax(I.) dmax(II.) read root read node(II.) dmin(I.) dmin(II.) = 0 k-NN search in M-tree: example (k=2) DASFAA 2005, Beijing
dmax(C) dmin(C) dmax(D) dmax(O6) dmax(O5) dmin(D) read node(D) k-NN search in M-tree: example (k=2) DASFAA 2005, Beijing
dmax(O4) read node(I.) dmin(B) read node(B) k-NN search in M-tree: example (k=2) 5 nodes accessed, the same nodes accessed byrange query (Q ,d(Q,O5) ) DASFAA 2005, Beijing
PM-treemotivation • metric regions in M-treeare unnecessarily large indexing of large portions of empty space (the “dead” space) higher probability of intersection with query region less efficient search • reduction of metric region “volume” should lead to more effective discarding of irrelevant subtrees • the question is how to specify a compact metric region bounding all the objects more “tightly” generalization of the M-tree for another metric region shape representations DASFAA 2005, Beijing
PM-tree region utilization of global pivots (inspired by LAESA-like methods) given a fixed set ofpglobal pivotsPi (selected from (a part of) the dataset) phyper-ring regions(Pi, HR[i]) are defined for each routing entry array HR of p intervals <HR[i].min , HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi PM-tree region = M-tree region + HR array(pivots Pisharedby all PM-tree regions) intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects in leaves the more pivots, the more tightly bounded region PM-tree is built the same way as M-tree is built, i.e. the hyper-rings only „cut off“ the M-tree sphere DASFAA 2005, Beijing
query query PM-tree, query processing • distances d(Q, Pi) for all i ≤ p must be computed prior to processing a query • metric region (Oi , rOi , HR) is relevant to(intersected by) a range query (Q, rQ) just in case that all the hyper-rings and the hyper-sphere overlap the range query region the more hyper-rings, the lower probability of intersection with query no additional distance computations are needed for the intersection test Q Q M-tree region PM-tree region DASFAA 2005, Beijing
k-NN search in the PM-tree 3 modifications of M-tree’s k-NN algorithm • different intersection test between query region (Q, rQ)and PM-tree region (Oi , rOi , HR) Λt=1..p d(Pt , Q) – rQ≤HR[t].max Λd(Pt , Q) + rQ≥HR[t].min • different dmin construction (+ possible distance increase to the farthest hyper-ring) dmin(rout. entry to N) = max {0, d(Q , Oi) – rOi , HRfarthest} HRfarthest= maxt=1..p { d(Pt , Q) – HR[t].max , HR[t].min – d(Pt , Q) } • different dmax construction (+ possible distance decrease to the farthest object in the nearest hyper-ring)dmax(rout. entry to N) = max { d(Q , Oi) +rOi , HRnearest }HRnearest= mint=1..p { d(Q , Oi) + HR[t].max } DASFAA 2005, Beijing
dmax(I.) read root dmin(I.) dmax(II.) dmin(II.) read node(I.) k-NN search in PM-tree: example (k=2) DASFAA 2005, Beijing
read node(II.) read node(B) k-NN search in PM-tree: example (k=2) DASFAA 2005, Beijing
read node(D) k-NN search in PM-tree: example (k=2) 5 nodes accessed, the same nodes accessed byrange query (Q ,d(Q,O5) ) DASFAA 2005, Beijing
Experimental Results (synthetic datasets) • synthetic vector datasets (4D – 60D); 100,000 tuples; 1000 clusters • disk page sizes: 1 KB – 4 KB; index sizes: 4.5 MB – 55 MB DASFAA 2005, Beijing
Experimental Results(image database) • WBIIS imagedatabase; appr. 10,000 256D-vectors (gray histograms) • disk page size: 32 KB; index sizes: 16 MB – 20 MB DASFAA 2005, Beijing
References [1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, ADBIS 2004, Budapest, Hungary [2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná, Czech Republic [3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, Dresden, Germany, LNCS2798, Springer [4] Skopal T.: Metric Indexing in Information Retrieval PhD thesis, VSB-Technical University of Ostrava http://urtax.ms.mff.cuni.cz/~skopal/phd/thesis.pdf DASFAA 2005, Beijing