Reference-based Indexing of Sequence Databases

Reference-based Indexing of Sequence Databases University of Florida-Gainesville www.cise.ufl.edu/~jgvenkat VLDB 2006 Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine

Similarity Search Given threshold , find sequences similar to the query sequence. . . . . => Query Similar Sequences Sequence Database, S si sj sk Sequence Query

Measure: Edit Distance Edit Operations: Insert, Delete and Replace. Example: P: ACGTACGTAC_GT | |||| ||| || Q: A_GTACCTACCGT Sequence Length: 12 3 Edit Operations: 2 Insertions and 1 Replace Edit Distance is the minimum number of edit operations needed to transform one sequence to another.

Edit Distance: Complexity Time and space complexity for computing Edit Distance between two sequences is O(n2) . . . . Sequence Database, S |S| = 100,000 Query One Sequence Comparison: 0.25 second. Time taken for single query: 7 hours.

(K + |C|) << |S| Need for Indexing Sequence Database, S Select K sequences as references Candidate Set, C . . Query Query => Pre-compute reference-to-sequence distances

Existing Methods • Hierarchical Methods, • VP-Tree (Yianilos, 1993) • MVP-Tree (Bozkaya et al., 1997) • M-Tree (Ciaccia et al., 1997), • Slim-Tree (Traina et al., 2000), • DF-Tree (Traina et al., 2002). • DBM-Tree (Vieira et al., 2004) • Omni (Filho et al., 2001) • Frequency Vector (Kahveci et al., 2004).

Reference-based Indexing Reference Circle Including Query Database Sequences Reference Sequence Query

Reference-based Indexing Reference Circle Including Query Sequences outside the reference circle (far from the reference) are pruned. Sequences close to the references can also be pruned Database Sequences Reference Sequence Query

Reference-based Indexing Reference Circle Excluding Query Database Sequences Reference Sequence Query

Reference-based Indexing Reference Circle Excluding Query Sequences inside the reference circle (close to the reference) are pruned. Database Sequences Reference Sequence Query

Reference-based Indexing: Bounds Given a sequence s, reference r and query q, Lower Bound: Minimum Distance between q and s with r as reference, |d1-d2|. Upper Bound: Maximum Distance between q and s with r as reference, d1+d2. Upper Bound d1 Lower Bound d2 Database Sequence Reference Sequence Query

Observations • Two types of pruning: • Sequences close to references. • Sequences far from references. • A good reference set should be able to use both kinds of pruning effectively. • Each reference should prune some part of the database not pruned by other references.

Outline • Selection of References • Reference Assignment • Search Algorithm • Experimental Results • Conclusions

Our Contributions • Selection of References: • Maximum Variance Selection: Reference with high variance of distance distributions with other sequences in the database. • Maximum Pruning: A Combinatorial approach of selecting the best reference set. • Assignment of References: • Each sequence has different set of references.

Selection of References:Maximum Variance (MV) Basic Idea: Select references having more sequences close to and far from it, and hence can prune them. Bad Good Database Sequences

Selection of References:Maximum Variance (MV) • Select references having sequences close to and far away from them. • References have maximum variance of distance distributions with other sequences in the database. • New reference prunes some part of the database not pruned by existing set of references.

Maximum Variance: Algorithm Compute Distances Remove Sequences Close to or Far away from New Reference Sort => Random Subset of Sequences Sequence Database Candidate Reference Set Variance of Distance Distributions

Maximum Variance: Example a b f e c g d Database Sequences Maximum Variance Ordering Reference Sequences

Selection of References:Maximum Pruning (MP) • Combinatorial approach to select the best reference set for given query set. • Select reference set that can prune more sequences over all queries. • Sample query set Q’ following the actual query distribution is given. • Sampling techniques to reduce the complexity of this method.

Maximum Pruning: Algorithm GAINS Reference Set Sequence Database Candidate References Sample Queries, Q’

Maximum Pruning: Algorithm GAINS Reference Set Candidate References Sequence Database Sample Queries, Q’

Maximum Pruning: Algorithm GAINS Reference Set MAX() Candidate References Sequence Database Sample Queries, Q’ Repeat Until MAX() > 0

Maximum Pruning Example f Reference Set e a d q d b1 a b3 b2 b c e Database Sequences Reference Sequences Sequences pruned by a

Maximum Pruning Example f Reference Set e d a q d b1 a b3 b2 b c e Database Sequences Reference Sequences Sequences pruned by a

Outline • Selection of References • Assignment of References • Search Algorithm • Experimental Results • Conclusions

Assignment of References Sequence Database, S Select K sequences as references Candidate Set, C Query . . => Query Pre-compute reference-to-sequence distances (K + |C|) << |S| Assign K references to each sequence Increase the Number of references to m (m + |C’|) < (K + |C|) << |S| . . Query Query => Candidate Set, C’

Reference Assignment: Example Number of References = 2 f q1 d ba1 a ba2 q2 c b e q3 bc1 References for b

Search Algorithm Pre-compute Sequence-Reference Distances Compute Query-Reference Distances MAX(LB) MIN(UB) Query, q Upper Bounds Lower Bounds Reference set, V If MAX(LB) ≤ ε ≤ MIN(UB), add s to Candidate set, If ε > MIN(UB), add s to Result set. If ε < MAX(LB), add s to Pruned set. Sequence Database, S

Experimental Setup • Datasets • DNA: Alphabet size of 4 and 20000 sequences. • Protein: Alphabet size of 20 and 4000 sequences of up to 500 amino acids. • Text: Alphabet size of 36 and 8000 sequences of length 100 each. • Size of Reference Set, m = 200. • Experiments, • Comparison with our methods • Maximum Variance with same and different reference sets (MV-S and MV-D). • Maximum Pruning with same and different reference sets (MP-S and MP-D). • Comparison with other methods • Frequency Vector (Kahveci et al., 2004). • Omni (Filho et al., 2001) • Others: M-Tree (Ciaccia et al., 1007), Slim-Tree (Traina et al., 2000), DBM-Tree (Vieira et al., 2004) and DF-Tree (Traina et al., 2002).

Comparison of Our Methods DNA Dataset k = 4

Comparison of Our Methods DNA Dataset Range = 8

Comparison with Other Methods DNA Dataset, k = 16

Conclusion • References selected by Maximum Variance and Maximum Pruning eliminates more database sequences as compared to existing selection strategies. • Assigning different reference set to each sequence dramatically improves the performance. • MP-D outperforms existing methods in almost all the experiments.

Thank You Questions ? jgvenkat@cise.ufl.edu

Comparison with Other Methods: Protein Dataset Query Range = 300

Assignment of References: Memory Limitations • Main memory stores pre-computed reference-to-sequence distances along with the references. • For each [s,vi] pair (s S, vi V), store [i,ED(s,vi)] (Takes 8 bytes). • Given the available main memory in bytes, B B = 8KN + zm N: Number of sequences in the database. K: Number of references per sequence. z: Size of each sequence in bytes. m: Number of references in reference set. • Example: Given B = 1 GB, N = 10 million, z = 100 and m = 1000, then K = 13.

Reference-based Indexing of Sequence Databases

Reference-based Indexing of Sequence Databases

Presentation Transcript

Sequence Databases

Sequence Databases

Sequence Databases

REFERENCE CHEMISTRY DATABASES

Indexing Correlated Probabilistic Databases

Sequence databases

Querying Sequence Databases

Searching Sequence Databases

Sequence Databases

Searching Sequence Databases

Indexing Biological Sequence Data

Spatial Databases - Indexing

Protein Sequence Databases

Indexing transaction time databases

Sequence Indexing Schemes

Sequence Databases

Nucleotide Sequence Databases

Protein sequence databases

Sequence Databases

Spatial Databases - Indexing

Reference-based Indexing of Sequence Databases