350 likes | 459 Views
Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung. Similarity Search on Bregman Divergence, Towards Non-Metric Indexing. Metric v.s. Non-Metric. Euclidean distance dominates DB queries Similarity in human perception Metric distance is not enough!. Outline.
E N D
Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung Similarity Search on Bregman Divergence, Towards Non-Metric Indexing
Metric v.s. Non-Metric • Euclidean distance dominates DB queries • Similarity in human perception • Metric distance is not enough! Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Outline • Bregman Divergence • Solution • Basic solution • Better pruning bounds • Query distribution • Experiments • Conclusion Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Bregman Divergence h (q,f(q)) convex function f(x) (p,f(p)) Bregman divergence Df(p,q) q p Euclidean dist. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Bregman Divergence • Mathematical Interpretation • The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q original f(x) first order Taylor expansion of f(x) at q Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Bregman Divergence • General Properties • Uniqueness • A function f(x) uniquely decides the Df(p,q) • Non-Negativity • Df(p,q)≥0 for any p, q • Identity • Df(p,p)=0 for any p • Symmetry and Triangle Inequality • Do NOT hold any more Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Examples Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Why in DB system? • Database application • Retrieval of similar images, speech signals, or time series • Optimization on matrices in machine learning • Efficiency is important! • Query Types • Nearest Neighbor Query • Range Query Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Euclidean Space • How to answer the queries • R-Tree Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Euclidean Space • How to answer the queries • VA File Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Our goal • Re-use the infrastructure of existing DB system to support Bregman divergence • Storage management • Indexing structures • Query processing algorithms Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Outline • Bregman Divergence • Solution • Basic solution • Better pruning bounds • Query distribution • Experiments • Conclusion Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Basic Solution • Extended Space • Convex function f(x) = x2 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Basic Solution • After the extension • Index extended points with R-Tree or VA File • Re-use existing algorithms with new lower and upper bound computation Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
How to improve? • Reformulation of Bregman divergence • Tighter bounds are derived • No change on index construction or query processing algorithm Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
A New Formulation h h’ query vector vq Df(p,q)+Δ q p D*f(p,q) Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Math. Interpretation • Reformulation of similarity search queries • k-NN query: query q, data set P, divergence Df • Find the point p, minimizing • Range query: query q, threshold θ, data set P • Return any point p that Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Naïve Bounds • Check the corners of the bounding rectangles Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Tighter Bounds • Take the curve f(x) into consideration Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Query distribution • Distortion of rectangles • The difference between maximum and minimum distances from inside the rectangle to the query Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Can we improve it more? • When Building R-Tree in Euclidean space • Minimize the volume/edge length of MBRs • Does it remain valid? Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Query distribution • Distortion of bounding rectangles • Invariant in Euclidean space (triangle inequality) • Query-dependent for Bregman Divergence Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Utilize Query Distribution • Summarize query distribution with O(d) real number • Estimation on expected distortion on any bounding rectangle in O(d) time • Allows better index to be constructed for both R-Tree and VA File Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Outline • Bregman Divergence • Solution • Basic solution • Better pruning bounds • Query distribution • Experiments • Conclusion Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiments • Data Sets • KDD’99 data • Network data, the proportion of packages in 72 different TCP/IP connection Types • DBLP data • Use co-authorship graph to generate the probabilities of the authors related to 8 different areas Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiment • Data Sets • Uniform Synthetic data • Generate synthetic data with uniform distribution • Clustered Synthetic data • Generate synthetic data with Gaussian Mixture Model Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiments • Methods to compare Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiments • Index Construction Time Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiments • Varying dimensionality Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiments • Varying dimensionality (cont.) Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Experiments • Varying k for nearest neighbor query Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Conclusion • A general technique on similarity for Bregman Divergence • All techniques are based on existing infrastructure of commercial database • Extensive experiments to compare performances with R-Tree and VA File with different optimizations Similarity Search on Bregman Divergence: Towards Non-Metric Indexing
Acknowledgment • Zhenjie Zhang, Anthony K. H. Tung and Beng Chin Ooi were supported by Singapore NRF grant R-252-000-376-279. • Srinivasan Parthasarathy was supported by NSF IIS-0347662 (CAREER) and NSF CCF-0702587.