NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces

NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in PragueDepartment of Software Engineering Czech Republic DEXA 2008, Turin, Italy, Sep 1-5

Presentation outline • Metric similarity search • Semimetric tuning • M-Tree • NM-Tree • Experimental results • Discussion

query object Metric similarity search • How to search in multimedia databases (MDB)? • textual annotation is expensive and ambiguous • MDB objects are not structured (we cannot use a structured query language, like SQL) • solution: content based similarity searching;similarity between a query and DB object is interpreted as a relevance • similarity is often modelled by a distance function d satisfying metric properties-> metric searching -> metric access methods (MAMs) • the distance functiond is supposed to be computationallyexpensive-> sequential search is unfeasible -> external indexing structures • metric properties are advantageous for indexing, but may be unsuitable for domain experts • mainly thetriangle inequality -> let's d be relaxed to a semimetric (i.e., reflexive, non-negative, symmetric distance) • we use MAMsalso with semimetric distances, but we have to take more or less incorrect behavior into account • false dismissals & false positives for semimetrics

Semimetric tuning • semimetric brings less limitations for domain experts, but… • semimetric doesn’t guarantee triangle inequality for every triplet of objects in the database -> a lot of non-triangle triplets causes a lot of false dismissals in MAMs and vice versa • exact metric searching = all distance triplets must be triangle triplets • non-triangle triplets cause only approximate but faster search • semimetric tuning = changing the proportion of non-triangle triplets (generated by ds) by applying modifierf(real function) on original semimetric, e.g. ds* = f (ds) • ds* should - satisfy triangle inequality to some extent (controlled precision of searching) - generate lower-dimensional distance distributions (faster searching) a a c c b b Triangle triplet: a + b >= c Non-triangle triplet: a + b < c

Properties of the modifier f f is increasing – preservesoriginal query orderings triangle-generating f = concave f = turns more distance triplets into triangle ones = slow but precise searching triangle-violating f = convex f = turns more distance triplets into non-triangle ones = fast but approximate searching parametric T-bases TriGEN – an algorithm for finding an f that satisfies the user-requested retrieval error e and maximizes the search efficiency (lowest intrinsic dimensionality), see [2] Modified distancesmay form the triangle triplet f f guarantees the fish is always more similar to sea-maid than to girl ds* ds* ds ds

range query Q (euclidean 2D space) M-tree • dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • the leaves are clusters of indexed objects Oj (ground objects) • routing entries in the inner nodes represent hyper-spherical metric regions (Oi , rOi), recursively bounding the object clusters in leaves • the triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation

M-tree filtering a) basic filtering (expensive)b) parent filtering (cheap) d(R, Q) > rR + rQ |d(P, Q) – d(P, R)| > rR + rQ

NM-tree motivation • separated usage of TriGen and M-tree - limitations • M-treehierarchy depends on the topology of ds (specific to f used) • For another measureds*the database must be re-indexed • To provide user with choice of precision/efficiency tradeoff we need to maintain more M-trees- each for particular dissimilarity measure !!! • T-bases (returned by TriGEN)natively support inverse modification (as proposed in [2])ds = fe(fe(ds , w), -w) notation : fe-1~ fe(ds, -w) ds* We can mimic multiple M-trees with just one M-tree and an appropriate set of modifiers fe -> NM-tree

NM-tree setbacks Native combination of M-tree and TriGEN brings some problems… • determining modifiers for an empty index (no data received) • solution : gather first k objects into a sequential file then find modifiers • guarantee of exact searching • solution : use metric dm = fm(ds) for inserting -> it is necessary to find fm modifier in the initial phase • aggregated distances stored in M-tree as covering radii • cannot be correctly remodified by f • solution : approximate search only at the preleaf & leaf level

NM-tree • structure • the same structure as for the M-tree • maintains modifiers fe for distance modification • construction • Initial phase • inserting into the sequential file until sufficient number of objects is gathered • then finding modifiers for requested retrieval errors including fm for error = 0 • all objects from the sequential file are inserted into the NM-tree • Second phase • inserting into the NM-tree under dm = fm(ds) -> this makes possible exact searching • querying • additional query parameter – an error threshold e (such that fe is available in NM-tree) • Exact search –NM-tree querying under dm (+ result distances remodified by fm-1) • Approximate search – for stored values (in the index) it is necessary to makeconversion from dm to ds using fm-1 and subsequentlyconversion from ds to desired ds* using fe

NM-tree approximate querying • For upper levels the metric search is performed • For the preleaf and leaf level all the distances used for pruning are modified to the desired semimetric depending on the user-defined error threshold e entry modification d2p* = fe(fm-1(d2p)) e_radius* = fe(fm-1(e_radius)) query modification q_radius* = fe(q_radius) e2qd* = fe(e2qd) dm ds*

NM-tree querying example

Experiments • We have compared multiple M-trees (each for semimetric determined by user defined error) with NM-tree • We have performed our tests on two databases • 68,040 32-dimensional Corel features • 250,000 synthetic 2D polygons, each consisting of 5 to 15 vertices • We have tested one semimetric and one metric on each database • COREL – L0.75 and L2 • Polygons – DTW and Hausdorff • For TriGEN we have selected 10 modifiers for each dissimilarity measure and T-errors (values correlated with retrieval error)within [0 – 0,32]

Experimental results

References • [1] Ciaccia P., Patella M., Zezula P.M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997) • [2] Skopal T.Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Transactions on Database Systems 32(4), 1–46 (2007) • [3] Chávez E., Navarro G.A Probabilistic Spell for the Curse of Dimensionality.In: Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX 2001. LNCS, vol. 2153, pp. 147–160. Springer, Heidelberg (2001)

... Thank you for attention Questions ??

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces