Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model

Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/16/2011

Outline • Similarity query and biological applications • Indexing for similarity query • Distance-based (metric space) indexing • The pivot space model

r q 1. Similarity Query Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results

Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;

Example 2: Gas station near Purdue

Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios

Image retrieval [CIT05]

Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers

Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance

Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.

2. Indexing for similarity query • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning

Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?

x d(x,z) d(x,y) y z d(y,z) 3. Distance-based indexing Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101

Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]

C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L

VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children

C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)

Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict

4. The pivot space model • metric space  Rk • multi-dimensional indexing  query cube • direct evaluation of cube General Methodology

P S Pivot space Mapping: M  Rk :x  Pivot space: The image of S in Rk

Complete pivot space Let all the points be pivots: P = S, M  Rn L∞((a1,a2,…,an),(b1,b2,…,bn)) = maxi|ai-bi|

Distance-based indexing  High dimensional indexing Isometric mapping High-dimensional vector space General metric space Dimension reduction Low-dimensional vector space Multi- dimensional Indexing Data partition Sequential comparison Result set in metric space Result set in low-dim space Are we done?

Two distinctions • 1. Pivot selection vs. dimension reduction • 2. query ball vs. query cube

4.1 Pivot selection: Dimension reduction for distance-based indexing(SISAP2010 Best Paper) • Answer queries directly in the complete pivot space? Dimension reduction is inevitable. 2. Dimension reduction for the complete pivot space? Can only select existing dimensions: pivot selection 3. how to select pivots? Use Rn method to create new dimension Find closest existing dimension (pivot)

y=d(p2, xi) L: y = x y=d(vp2, xi) 0 x=d(p1, xi) child4 child1 d21 d22 p1 child2 p2 child3 L 0 x=d(vp1, xi) d11 4.2 Hyperplane partition in pivot space General Hyperplane Tree (GHT) Multiple Vantage Point Tree (MVPT)

p1 p2 p1 p2 Complete GHT GHT: pivot space CGHT: pivot space MVPT: pivot space MVPT: metric space CGHT: metric space

r-neighborhood Nr(L), the r-neighborhood of a partition boundary L, is the neighborhood of L in the pivot space, into which if a query object q falls, R(q,r) could have results in both sides of L. • Assuming q has the same distribution as the database,|Nr(L)|dominates query performance. • Width & Density

Nr(L): |x-μ|≤ r 2r L: x = μ • Special case: L: x = μ • Width = 2r d(p2, x) q Nr(L): |y-x| ≤ 2r r y = x + 2r L: y = x y = x – 2r 2r 2r -2r 0 0 d(p1, x) d(p1, x) (b) Special case: L: y = x Width = Min width of r-neighborhood MVPT partition has the minimal width of r-neighborhood

|Nr(L)|: analytical & empirical • 2-d normal dist.: N(0, 1, 0, 1, -ρ), 0≤ρ≤1 • Empirical results |NLV(r)|∝ PLV(r) = P(|x| ≤ r | x~N(0,1)) |NLV(r)|∝ PLV(r) = P(|x| ≤ r | x~N(0,1))

Dimension rotation might not be helpful! • A counter example

Conclusions and Future work Conclusions • Distance-based indexing is a very general approach • Pivot space model establishes an analogy between distance-based indexing and high dimensional indexing. • There are two distinctions between them. Future work • Multi-dimensional/statistical methods • Non-linear partition • Moving to cloud environment • Applications.

Credit • Daniel P. Miranker, UT Austin • Willard L. Miranker, Yale University

Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng

Distance-Based Indexing: Applications in Bioinformatics & the Pivot Space Model