370 likes | 451 Views
Pivot Selection: Dimension Reduction for Distance-based Indexing. Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/23/2011. Outline. Similarity query and applications
E N D
Pivot Selection: Dimension Reduction for Distance-based Indexing Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/23/2011
Outline • Similarity query and applications • Distance-based (metric space) indexing • Pivot selection • PCA for distance-based indexing • Future direction
r q 1. Similarity Query Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results
Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;
Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios
Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers
Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance
Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.
2. Distance-based Indexing Indexing: • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning
Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?
x d(x,z) d(x,y) y z d(y,z) Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)
How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101
Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]
C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L
VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children
C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)
Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict SISAP2010 Best Paper
General Methodology • metric space Rk (pivot selection) • multi-dimensional indexing query cube (data partition) • direct evaluation of cube (post processing)
P S Pivot space Mapping: M Rk :x Pivot space: The image of S in Rk
Complete pivot space Let all the points be pivots: P = S, M Rn L∞((a1,a2,…,an),(b1,b2,…,bn)) = maxi|ai-bi|
Distance-based indexing High dimensional indexing Isometric mapping High-dimensional vector space General metric space Dimension reduction (pivot selection) Low-dimensional vector space Multi- dimensional Indexing Data partition Sequential comparison Result set in metric space Result set in low-dim space
3. Pivot selection Dimension reduction for distance-based indexing • answer queries directly in the complete pivot space? • dimension reduction for the complete pivot space? • why is pivot selection important? • how to select pivots?
3.1 Answer queries directly in the complete pivot space? Theorem: Evaluation of similarity queries in the complete pivot space degrades the query performance to linear scan. • Dimension reduction is inevitable
3.2 Dimension reduction for the complete pivot space? Theorem: If a dimension reduction technique creates new dimensions based on all existing dimensions, evaluation of similarity queries degrades to a linear scan Can only select existing dimensions: pivot selection
A B C 0 1 2 x 3 C A B B A C 0 0 1 1 2 2 d(x,A) d(x,C) 3 3 B A,C 0 1 2 d(x,B) 3 3. 3 Why is pivot selection important? • Building index tree: a process of information loss • Information available to data partition is determined by pivot selection A B C 1 2 3 original value --------------------------------- 0 1 2 pivot: A 2 1 0 pivot: C 1 0 1 pivot: B
Importance of pivot selection Uniformly distributed points in unit square Figure of distances to the pivots Pivots: (0,0) and (1,1) Pivots: (0,0) and (1,0)
Importance of pivot selection 14-bit Hamming strings (“0/1” strings) Figure of distances to the pivots Pivots: opposite corners Pivots: neighboring corners 00 0000 0000 0000 00 0000 0000 0000 11 1111 1111 1111 00 0000 0111 1111
3.4 How to select pivots? • Heuristic: for each new dimension, select the point with the largest projection on that new dimension in the pivot space. • Using of mathematical tools in Rn • Yet what is a good objective function for pivot selection? • Empirically: select pivot, build tree, run queries: average speed of query
4. PCA for distance-based indexing • Pivot selection • Estimate the intrinsic dimension
PCA for pivot selection • PCA for the complete pivot space. • Apply the heuristic: for each PC, find the most similar (minimal angle) dimension(point) in the complete pivot space
Estimate the intrinsic dimension • Pair wise distances ρ =μ2 /2σ2 2. |Range(q,r)| ∝ rd • Linear regression: log(|Range(q,r)|) = dlog(r) + c 3. Where PCA eigenvalues change the most: • argmaxi (λi / λi+1)
5. Future work • Other dimension reduction methods • Objective function of pivot selection • Pair-wise distance • Multi-variable regression method • Forward selection, backward elimination • Choice of y, mean? Standard deviation? • Variable selection • Non-linear regression
Credit • Daniel P. Miranker, UT Austin • Willard L. Miranker, Yale University Rui Mao, Willard L. Miranker and Daniel P. Miranker, “Dimension Reduction for Distance-Based Indexing”, in the Proceedings of the Third International Conference on SImilarity Search and APplications (SISAP2010),pages 25-32, Istanbul, Turkey,September 18 - 19, 2010.
Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng