Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in PragueDepartment of Software Engineering Czech Republic

Presentation Outline • Metric Access Methods (MAMs) • M-tree, PM-tree • Query processing and Filtering • Nearest-neighbor graphs → M*-tree, PM*-tree • filtering • pivot selection strategies • Experiments ADBIS 2007

Metric Access Methods • Indexing methods designed for searching metric datasets • Similarities among objects are modeled by a distance function which fulfills metric properties • MAMs focus on minimizing number of distance computations by storing the distances in index, thus filtering non-relevant objects when querying • Methods • GNAT, (m)vp-tree, D-index, (L)AESA, … • M-tree, PM-tree ADBIS 2007

M-tree (Metric tree) • dynamic, hierarchical index structure • data space divided into ball shaped data regions (hyper-spheres) • root node represent data region covering all data • children nodes represent regions covering parts of the space, … • built in bottom-up way like b-tree • when node is full, new node is created and the objects are separated be • data regions form balanced hierarchical structure • inner nodes → routing entries • leaf nodes → ground items ADBIS 2007

Query Processing + Filtering • range and k nearest neighbor (kNN) queries • traversing from the root node • in case of kNN dynamically decreasing query radius • basic filtering→ filter out nodes whose parent data region doesn’t intersect the query region • parent filtering→ using precomputed distance of an object to the parent and of the parent to the query ADBIS 2007

query query PM-tree (Pivoting Metric tree) • PM-tree = M-tree enhanced by p static global pivots and each hyper-sphere region enhanced by p hyper-ring regions – rings which restrict it’s volume • ith ring defined by nearest and furthest objects in the node according to ith pivot • query region overlaps node region only if it overlaps hyper-sphere and all hyper-rings → more effective basic filtering Q doesn’t overlap 2. ring Q Q M-tree region PM-tree region ADBIS 2007

Pivot space • global pivots map regions/data into a pivot space of dimensionality p (ith coordinate → distance to ith pivot) • distances of a data region to p pivots produces p-dimensional minimum bounding rectangle • the overlap with rings can be understood in this sense as L∞ filtering (region is filtered out if it’s L∞ distance to Q is smaller then the query radius) ADBIS 2007

M*-tree, PM*-tree • M*-tree = M-tree + nearest-neighbor (NN) graphs • present in every node • each object knows it’s NN (within it’s node) • example → • PM*-tree = PM-tree + nearest-neighbor (NN) graphs O6 = NN(O4) ADBIS 2007

NN-graph Filtering • objects (NN graph nodes) play role of mutual local pivots • sacrifice • local pivot • object whose distance to the query is really computed by query evaluation • used for possible filtering of reverse nearest neighbours (rNNs) • filtering with NN-graph (one step of node processing) • fetch first record (Si) from sacrifices queue (SQ) • apply parent filtering to Si • If Si not filtered → sacrifice (compute Q-Si distance) • try to filter out rNNs(Si) (NN-graph filtering) • move non-filtered rNNs(Si) to the beginning of SQ (rNNs sets are disjoint → non-filtered become sacrifices) • apply basic filtering to Si ADBIS 2007

Sacrifice selection • selection of sacrifices is important • good pivot filters many objects out • poor pivot filters good possible pivot(s) (future sacrifices) • Heuristics • M*-tree • hMaxRNNCount • first in SQ is object with highest number of rNNs • hMinRNNDistance • first in SQ is object nearest to its NN or rNN • hMinToParentDistance • first in SQ is object closest to parent object • PM*-tree • hMinLmaxDistance • first in SQ is object with minimum L∞ distance • hMaxLmaxDistance • first in SQ is object with maximum L∞ distance ADBIS 2007

Experimental Results • Corel dataset • 65,615 feature vectors of images • L1 distance function • 8 dimensions • Polygons dataset • synthetic • 1,000,000 randomly generated 2D polygons (5-10 vertices) • Hausdorff set distance function • GenBank Dataset • 250,000 strings of proteins (of lengths 50-100) • edit distance function • Testing of • computation costs (number of distance computations) ADBIS 2007

Experiments – Corel Dataset ADBIS 2007

Experiments – Polygons Dataset ADBIS 2007

Experiments- Genbank Dataset ADBIS 2007

Conclusion • We have proposed • enhancing nodes of M-tree like structures by nearest-neighbors graphs • filtering technique based on NN-graphs → NN-graph filtering • We have implemented • M*-tree (enhancement of M-tree by NN-graphs) • PM*-tree (enhancement of PM-tree by NN-graphs) • Experimental results • we have shown up to 45% speed-up ADBIS 2007

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Presentation Transcript

K-nearest neighbor methods

Nearest Neighbor Classifiers

Reverse Nearest Neighbor Aggregates

Nearest-Neighbor Classifiers

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Nearest Neighbor

Performance of Nearest Neighbor Queries in R-trees

Fast Nearest-neighbor Search in Disk-resident Graphs

Nearest neighbor matching

Nearest-Neighbor Classifiers

Ensembles of Nearest Neighbor Forecasts

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Classification Nearest Neighbor

The Nearest-Neighbor Classifier

Nearest Neighbor

K nearest neighbor

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

K-Nearest Neighbor

K-Nearest Neighbor Learning

Classification Nearest Neighbor

Learning: Nearest Neighbor

Nearest Neighbor Classifier