230 likes | 430 Views
Christian Böhm 1 , Florian Krebs 2 , and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join. Feature Based Similarity. Simple Similarity Queries.
E N D
Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel21University for Health Informatics and Technology, Innsbruck2University of MunichOptimal Dimension Order: A Generic Technique for the Similarity Join
Simple Similarity Queries • Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q.
R S Join Applications: Catalogue Matching • Catalogue matching • E.g. Astronomy catalogues
Join Applications: Clustering • Clustering (e.g. DBSCAN) • Similarity self-join
e R-Tree Similarity Join • Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993] R S
The e-kdB-Tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] • Assumption: 2 adjacent e-stripes fit in main mem. • Unrealistic for large data sets which are ... • clustered, • skewed and • high-dimensional data
Epsilon Grid Order [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]
Common Properties • Decomposition of data/space into regions • Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) £e for each pair of points (p,q) on (P,Q) testdist (p,q) £e ; • Most CPU-effort in distance test between vectors:ÞIdea: Speed-up distance test
Related Work: Plane Sweep for Polygons [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000] • Observations: • More efficient to use x-axis as sweep direction. • Projection of polygons to y-axis yield high overlap • Decide by projections of the bounding boxes(integrate a pdf)
Feature Vectors in the Similarity Join • Distance computation between feature vectors p,q for (i=0 ; i<d ; i++) { dist2 = dist2 + (p[i] - q[i])2 ; if (dist2 > e2) break ; } • Order dimensions by Mating Probability (increasing) d1 d0
Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis d1 d0
d0 d0 d0 d0 d0 d0 Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0
Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0[Q] d0-Projection of each point pair located in this event space d0[P]
y £ x + e y ³ x - e Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0[Q] d0-Projection of each point pair located in this event space mating point pairs on e-stripe d0[P] e e
Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0[Q] Mating Probability for d0 e d0[P] e
Optimal Dimension Order • For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability • Algorithm: for each pair (P,Q) of partitions having dist (P,Q) £e determine ODO ; for each pair of points (p,q) on (P,Q) testdist (p,q) £eusing ODO ;
Shape of the Intersection Area • 20 different shapes are possible, e.g. 1223 2233 2223 • Easy proof of completeness and efficient case distinction by assigning codes to the corners • 1: Corner is left or above the e-stripe • 2: Corner is on the e-stripe • 3: Corner is right or below the e-stripe • Easy formulas (only 45° and 90° angles)
Experimental Evaluation: R-tree Sim. Join • 8-dimensional data, uniformly distributed
Experimental Evaluation: R-tree Sim. Join • 16-dimensional data, from CAD-similarity search
Experimental Evaluation: Scalability MuX, uniform data Z-RSJ, uniform data
Experimental Evaluation: Scalability EGO, CAD data
Conclusion • Conclusion: • Similarity join is an important database primitive for knowledge discovery in databases • Many different basic algorithms • Most accelerable by our optimal dimension order • Future Work: • New applications of the similarity join • Further optimization (multi-parameter) of the sim. join • Parallel and distributed environments