1 / 17

Feature Based Similarity

Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. Feature Based Similarity. Simple Similarity Queries. Specify query object and

landon
Download Presentation

Feature Based Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel,University of MunichEpsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data

  2. Feature Based Similarity

  3. Simple Similarity Queries • Specify query object and • Find similar objects – range query • Find the k most similar objects – nearest neighbor q.

  4. R S Join Applications: Catalogue Matching • Catalogue matching • E.g. Astronomic catalogues

  5. Join Applications: Clustering • Clustering (e.g. DBSCAN) • Similarity self-join

  6. Grid partitioning • General idea: Grid approximation where grid line distance = e • Similar idea in the e-kdB-tree[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] • Disadvantage of any grid approach:Number of neighboring grid cells: 3d- 1

  7. Scalability of the e-kdB-tree • Assumption: 2 adjacent e-stripes fit in main mem. • Unrealistic for large data sets which are ... • clustered, • skewed and • high-dimensional data

  8. Epsilon Grid Order

  9. e-Grid-Order Is a Total Strict Order • Strict Order: • Irreflexivity • Transitivity • Asymmetry • e-grid-order can be used in any sorting algorithm

  10. e-Interval • Coarse approximation of join mates:Used for I/O processing

  11. I/O Processing for the Self Join • Decompose the sorted file into I/O units

  12. Epsilon Grid Order

  13. CPU Processing • I/O units are further decomposed before joining • Simple divide-and-conquer: No further sorting • Decomposition: maximize active dimensions

  14. CPU Processing • Point distance computations: Order of dimensions • Neighboring inactive dimensions • Unspecified dimensions • Active dimension • Aligned inactive dimensions

  15. Experimental Results • 8-dimensional uniformly distributed vectors

  16. Experimental Results (2) • 16-d feature vectors from CAD application

  17. Conclusions • Summary • High potential for performance gains of the similarity join by page capacity optimization • Necessary to separately optimize I/O and CPU • Future research potential • Similarity join for metric index structures • Approximate similarity join • Parallel similarity join algorithms

More Related