200 likes | 371 Views
Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods. Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information Science and Engineering, University of Florida. Outline. Problem/Solution Background The Omni-concept Members of the Omni-family
E N D
Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information Science and Engineering, University of Florida
Outline • Problem/Solution • Background • The Omni-concept • Members of the Omni-family • Experimental Results
Problem • Diverse and complex data • How to search • Expensive distance calculations
Solution • Reduce the number of distance calculations • The Omni-Concept/Family • Select a set of foci • Gauge all other objects with their distance from this set • The foci increase the pruning of distance calculations • Scalable
Background: Metric Spaces • Set of objects S = {s1,s2,s3,…,sn} of domain S, d() has following properties: • Symmetry: d(s1,s2) = d(s2,s1) • Non-negativity: 0<d(s1,s2) < infinity, s1≠ s2, and d(s1,s1) = 0 • Triangle inequality: d(s1,s3) ≤ d(s1,s2) + d(s2,s3) • A metric space is a pair M = <S,d()> • Spatial datasets following an Lp distance function are special cases of metric spaces.
Range and NN Queries • Range: Given a query object sq, and a max search distance rq: Rquery(sq,rq)= {si | si ∈ S: d(si,sq) ≤ rq} • NN: Given a query object sq ∈ S: NNquery(sq)= {sn ∈ S | ∀si ∈ S: d(sn,sq) ≤ d(si,sq)}
Current solutions • Metric tree of Uhlmann • Vantage-point tree • Generalized hyper-plane tree • Multi-vantage point tree • Geometric Near Access tree • The M-tree
Intrinsic Dimensionality • Some assume embedding dimensionality of dataset define behavior on a query. • Datasets can inhibit small portion of embedding space. • Intrinsic dimensionality gives better precision in selectivity. • Use correlation of fractal dimensions D2 as an approximation of the intrinsic dimension.
Omni-concepts • Omni-foci base (F): Given M F = {f1,f2,…,fl | fK ∈ S, fk≠fj, l≤N}, • Omni-coordinates (Ci): Ci = { <fk, d(fk,si)>, for all fk ∈ F} • mbOr: Given F and a collection of objects A = {x1,x2,….xn} ⊂ S, the intersection of the metric intervals RA = |l1 Ii where Ii = [min(d(xj,fi)), max(d(xj,fi))}, 1 <=i<=l, 1 <= j <=n.
df1b df1a df2b df2a df1b df1a
Cardinality of F • Good number for the cardinality of F would be between the next integer that contains the intrinsic dimension ceil(D2)+1 and 2*ceil(D2)+1.
How to choose foci: HF-Algorithm s1 3 s4 5.5 3 7 10 s3 6 5 s5 2 s6 6 s2
HF-Algorithm • HF-Algorithm practical: O(N) • Requires l*N distance calculations • Best foci algorithm O(N!/(N-l)!)
Omni-sequential • Omni-sequential Calculate Ci Precede distance calculation by for fk ∈ F if | dfk(si) – dfk(sq) | > rq then skip distance calc.
OmniB+-tree • Store Ci in l B+trees, one for each focus • Subsets Ik⊂ S are retrieved from corresponding b+-tree and used to generate mbOr. • Ik is objects between dfk(sq) – rq and dfk(sq) + rq • Calculate distance from sq to each obj in intersection.
OmniR-tree • Algorithm to do insertion, node partitioning, range queries are same. • KNN requires NN algorithm used in metric tree. A deep search first preformed to find k-candidates. Continues reducing radius whenever the furthest neighbor is replaced, until every entry that overlaps the radius in the query has been tested.
OmniR-tree • Requires an R tree to store Ci • Requires a page direct access file to store the objects in the dataset. • When a leaf in R tree is retrieved, and the Ci stored in this node qualify objects, the actual distance is calculated.
Graph’s prove intrinsic dimensionality of the data is a good reference for the number of foci.
Review • Reduce the number of distance calculations • The Omni-Family • Select a set of foci • Gauge all other objects with their distance from this set • The foci increase the pruning of distance calculations • Scalable