1 / 23

Feature Based Similarity

Christian Böhm 1 , Florian Krebs 2 , and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join. Feature Based Similarity. Simple Similarity Queries.

roy
Download Presentation

Feature Based Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Christian Böhm1, Florian Krebs2, and Hans-Peter Kriegel21University for Health Informatics and Technology, Innsbruck2University of MunichOptimal Dimension Order: A Generic Technique for the Similarity Join

  2. Feature Based Similarity

  3. Simple Similarity Queries • Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q.

  4. R S Join Applications: Catalogue Matching • Catalogue matching • E.g. Astronomy catalogues

  5. Join Applications: Clustering • Clustering (e.g. DBSCAN) • Similarity self-join

  6. e R-Tree Similarity Join • Depth-first traversal of two trees[Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993] R S

  7. The e-kdB-Tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] • Assumption: 2 adjacent e-stripes fit in main mem. • Unrealistic for large data sets which are ... • clustered, • skewed and • high-dimensional data

  8. Epsilon Grid Order [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

  9. Common Properties • Decomposition of data/space into regions • Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) £e for each pair of points (p,q) on (P,Q) testdist (p,q) £e ; • Most CPU-effort in distance test between vectors:ÞIdea: Speed-up distance test

  10. Related Work: Plane Sweep for Polygons [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000] • Observations: • More efficient to use x-axis as sweep direction. • Projection of polygons to y-axis yield high overlap • Decide by projections of the bounding boxes(integrate a pdf)

  11. Feature Vectors in the Similarity Join • Distance computation between feature vectors p,q for (i=0 ; i<d ; i++) { dist2 = dist2 + (p[i] - q[i])2 ; if (dist2 > e2) break ; } • Order dimensions by Mating Probability (increasing) d1 d0

  12. Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis d1 d0

  13. d0 d0 d0 d0 d0 d0 Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0

  14. Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0[Q] d0-Projection of each point pair located in this event space d0[P]

  15. y £ x + e y ³ x - e Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0[Q] d0-Projection of each point pair located in this event space mating point pairs on e-stripe d0[P] e e

  16. Computation of the Mating Probability To determine mating probability for di: • Project bounding boxes on di-axis • Consider two projections in 2-dimensional space d0[Q] Mating Probability for d0 e d0[P] e

  17. Optimal Dimension Order • For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability • Algorithm: for each pair (P,Q) of partitions having dist (P,Q) £e determine ODO ; for each pair of points (p,q) on (P,Q) testdist (p,q) £eusing ODO ;

  18. Shape of the Intersection Area • 20 different shapes are possible, e.g. 1223 2233 2223 • Easy proof of completeness and efficient case distinction by assigning codes to the corners • 1: Corner is left or above the e-stripe • 2: Corner is on the e-stripe • 3: Corner is right or below the e-stripe • Easy formulas (only 45° and 90° angles)

  19. Experimental Evaluation: R-tree Sim. Join • 8-dimensional data, uniformly distributed

  20. Experimental Evaluation: R-tree Sim. Join • 16-dimensional data, from CAD-similarity search

  21. Experimental Evaluation: Scalability MuX, uniform data Z-RSJ, uniform data

  22. Experimental Evaluation: Scalability EGO, CAD data

  23. Conclusion • Conclusion: • Similarity join is an important database primitive for knowledge discovery in databases • Many different basic algorithms • Most accelerable by our optimal dimension order • Future Work: • New applications of the similarity join • Further optimization (multi-parameter) of the sim. join • Parallel and distributed environments

More Related