Nearest Neighbours Search using the PM-tree

Nearest Neighbours Search using the PM-tree Tomáš Skopal1 Jaroslav Pokorný1 Václav Snášel2 1Charles University in PragueDepartment of Software Engineering Czech Republic 2VSB - Technical University ofOstravaDepartment of Computer ScienceCzech Republic DASFAA 2005, Beijing

Presentation Outline • Similarity search in Metric Spaces • M-tree • the structure • k-NN search • PM-tree(an extension of M-tree) • motivation • the structure • k-NN search • Experimental Results DASFAA 2005, Beijing

Similarity search in Metric Spaces • Similarity search • methods for content-based retrieval in multimedia databases • the similarity measure is often modelled by a metricd(satisfying triangular inequality, symmetry, reflexivity, non-negativity) • similarity queries (query by example)realized as metric queries • range query (Q, rQ) (specified by a query object Q and covering radius rQ) • k-NN query(Q, k) (specified by a query object Q and number of nearest neighbours k) Metric Access Methods (MAMs) • designed to search in metric datasets in order to keep the search costs minimal • search costs = number of distance computations + I/O costs • only distances between objects are used for indexing (the structure of object representation is not used for indexing) • many MAMs are not suitable for similarity search in large datasets • either a static method or high I/O search costs • M-tree and (recently) D-index are the only suitable candidates so far DASFAA 2005, Beijing

range query Q (euclidean 2D space) M-tree (metric tree) • dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • the leaves are clusters of indexed objects Oj (ground objects) • routing entries in the inner nodes represent hyper-spherical metric regions (Oi , rOi), recursively bounding the object clusters in leaves • the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.) during query evaluation DASFAA 2005, Beijing

k-NN search in the M-tree • branch-and-bound algorithm(similar to that of R-tree) • modification of range query algorithm, but the query radius rQ is dynamic • rQ decreasing from infinity to the distance to the k-th neighbour • utilized two structures: priority queue PR and sorted array NN • PR: stores requests for nodes not-filtered from the search yet • request of form [routing entry to a node N, dmin(N)], where dmin(N) is the lower bound distancefrom Q to all possible objects in N, i.e. dmin(rout. entry to N) = max {0 , d(Q , Oi)– rOi} where (Oi , rOi )is region of the N’s routing entry; (requests in PR sorted by dmin(N)) • NN: stores k candidate objects (or distance upper bounds) • at the end of algorithm run, NN contains the result, i.e. the k nearest neighbours • entry of form [candidate object Oi, d(Q,Oi)] or [ - , dmax(N)], where dmax(·) is the upper bound distance from Q to all possible objects in N, i.e dmax(rout. entry to N) = d(Q , Oi)+ rOi • PR stores only requests with dmin(·) < dmax(·), other requests are removed from PR • i.e. such requests are removed, which do not overlap the dynamic query region (Q , rQ) Query processing: the requests in PR are processed in FIFO manner → a node N is retrieved, while PR and NN structures are updates by routing/ground entries of N • PR is initialized to ([root,∞] ), NN is initialized by k entries [-,∞] to ( [- ,∞] , [- ,∞] , ... ) • optimal in I/O costs(the same I/O costs as range query (Q , d(Q , NN[5])) ) DASFAA 2005, Beijing

rQ = ∞ dmax(I.) dmax(II.) read root read node(II.) dmin(I.) dmin(II.) = 0 k-NN search in M-tree: example (k=2) DASFAA 2005, Beijing

dmax(C) dmin(C) dmax(D) dmax(O6) dmax(O5) dmin(D) read node(D) k-NN search in M-tree: example (k=2) DASFAA 2005, Beijing

dmax(O4) read node(I.) dmin(B) read node(B) k-NN search in M-tree: example (k=2) 5 nodes accessed, the same nodes accessed byrange query (Q ,d(Q,O5) ) DASFAA 2005, Beijing

PM-treemotivation • metric regions in M-treeare unnecessarily large indexing of large portions of empty space (the “dead” space) higher probability of intersection with query region less efficient search • reduction of metric region “volume” should lead to more effective discarding of irrelevant subtrees • the question is how to specify a compact metric region bounding all the objects more “tightly” generalization of the M-tree for another metric region shape representations DASFAA 2005, Beijing

PM-tree region utilization of global pivots (inspired by LAESA-like methods) given a fixed set ofpglobal pivotsPi (selected from (a part of) the dataset) phyper-ring regions(Pi, HR[i]) are defined for each routing entry array HR of p intervals <HR[i].min , HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi PM-tree region = M-tree region + HR array(pivots Pisharedby all PM-tree regions) intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects in leaves the more pivots, the more tightly bounded region PM-tree is built the same way as M-tree is built, i.e. the hyper-rings only „cut off“ the M-tree sphere DASFAA 2005, Beijing

query query PM-tree, query processing • distances d(Q, Pi) for all i ≤ p must be computed prior to processing a query • metric region (Oi , rOi , HR) is relevant to(intersected by) a range query (Q, rQ) just in case that all the hyper-rings and the hyper-sphere overlap the range query region  the more hyper-rings, the lower probability of intersection with query  no additional distance computations are needed for the intersection test Q Q M-tree region PM-tree region DASFAA 2005, Beijing

k-NN search in the PM-tree 3 modifications of M-tree’s k-NN algorithm • different intersection test between query region (Q, rQ)and PM-tree region (Oi , rOi , HR) Λt=1..p d(Pt , Q) – rQ≤HR[t].max Λd(Pt , Q) + rQ≥HR[t].min • different dmin construction (+ possible distance increase to the farthest hyper-ring) dmin(rout. entry to N) = max {0, d(Q , Oi) – rOi , HRfarthest} HRfarthest= maxt=1..p { d(Pt , Q) – HR[t].max , HR[t].min – d(Pt , Q) } • different dmax construction (+ possible distance decrease to the farthest object in the nearest hyper-ring)dmax(rout. entry to N) = max { d(Q , Oi) +rOi , HRnearest }HRnearest= mint=1..p { d(Q , Oi) + HR[t].max } DASFAA 2005, Beijing

dmax(I.) read root dmin(I.) dmax(II.) dmin(II.) read node(I.) k-NN search in PM-tree: example (k=2) DASFAA 2005, Beijing

read node(II.) read node(B) k-NN search in PM-tree: example (k=2) DASFAA 2005, Beijing

read node(D) k-NN search in PM-tree: example (k=2) 5 nodes accessed, the same nodes accessed byrange query (Q ,d(Q,O5) ) DASFAA 2005, Beijing

Experimental Results (synthetic datasets) • synthetic vector datasets (4D – 60D); 100,000 tuples; 1000 clusters • disk page sizes: 1 KB – 4 KB; index sizes: 4.5 MB – 55 MB DASFAA 2005, Beijing

Experimental Results(image database) • WBIIS imagedatabase; appr. 10,000 256D-vectors (gray histograms) • disk page size: 32 KB; index sizes: 16 MB – 20 MB DASFAA 2005, Beijing

References [1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, ADBIS 2004, Budapest, Hungary [2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná, Czech Republic [3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, Dresden, Germany, LNCS2798, Springer [4] Skopal T.: Metric Indexing in Information Retrieval PhD thesis, VSB-Technical University of Ostrava http://urtax.ms.mff.cuni.cz/~skopal/phd/thesis.pdf DASFAA 2005, Beijing

Nearest Neighbours Search using the PM-tree

Nearest Neighbours Search using the PM-tree

Presentation Transcript

Algorithms for Nearest Neighbor Search

Binary Search Tree

The Search for the Nearest Defective Matrix

GPU Nearest Neighbor Searches using a Minimal kd-tree

k -Nearest neighbors and decision tree

Binary Search Tree

Fast exact k nearest neighbors search using an orthogonal search tree

Binary Search Tree

Binary Search Tree

Binary Search Tree

Search Tree

K Nearest Neighbours based diagnosis of hyperglycemia

Location Based Nearest Keyword Search

Binary Search Tree

Binary Search Tree

Binary Search Tree

Binary Search Tree

Song Intersection by Approximate Nearest Neighbours

Tree representation and tree search