300 likes | 415 Views
On Hyper-plane Partition of Distance-Based Indexing. Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/22/2011. Outline. Similarity query and applications Distance-based indexing
E N D
On Hyper-plane Partition of Distance-Based Indexing Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/22/2011
Outline • Similarity query and applications • Distance-based indexing • The Complete General Hyper-plane Tree • VPT: possibly the optimal hyper-plane • Conclusion and future work
r q 1. Similarity Query & Applications Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results
Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;
Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios
Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers
Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance
Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.
2. Distance-based Indexing Indexing: • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning
Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?
x d(x,z) d(x,y) y z d(y,z) Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)
How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101
Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]
C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L
VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children
C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)
Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict SISAP2010 Best Paper, with Dr. Miranker Focus of this talk, with Dr. Miranker
General Methodology • metric space Rk • multi-dimensional indexing query cube • direct evaluation of cube
P S The pivot space model Mapping: M Rk :x Pivot space: The image of S in Rk
VP1 R1 p1 p2 Example of pivot space: VPT 22
p1 p2 L 3. The Complete General Hyper-plane Tree (CGHT) GHT: metric space GHT: pivot space Partition by d1-d2 p1 p2 MVPT: pivot space CGHT: metric space CGHT: pivot space
r-neighborhood Nr(L), the r-neighborhood of a partition boundary L, is the neighborhood of L in the pivot space, into which if a query object q falls, R(q,r) could have results in both sides of L. • Assuming q has the same distribution as the database,|Nr(L)|dominates query performance. • Width & Density
Nr(L): |x-μ|≤ r 2r L: x = μ • Special case: L: x = μ • Width = 2r d(p2, x) q Nr(L): |y-x| ≤ 2r r y = x + 2r L: y = x y = x – 2r 2r 2r -2r 0 0 d(p1, x) d(p1, x) (b) Special case: L: y = x Width = Min width of r-neighborhood MVPT partition has the minimal width of r-neighborhood
|Nr(L)|: analytical • 2-d normal dist.: N(0, 1, 0, 1, -ρ), 0≤ρ≤1 |NGHT(r)|∝ PGHT(r) = P(|x| ≤ r | x~N(0,1)) |NVPT(r)|∝ PVPT(r) = P(|x| ≤ r | x~N(0,1))
Dimension rotation might not be helpful! • A counter example
Conclusions and Future work Conclusions • Distance-based indexing is a very general approach • CGHT is an improvement on GHT • All the 3 families are hyper-plane partition in pivot space • VPT partition has the minimal width of r-neighborhood, and possibly minimal size of it. Future work • Multi-dimensional/statistical methods • Non-linear partition • Applications.
Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng