1 / 44

Reference-based Indexing of Sequence Databases

Reference-based Indexing of Sequence Databases. University of Florida-Gainesville www.cise.ufl.edu/~jgvenkat VLDB 2006. Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine. Similarity Search. Given threshold , find sequences similar to the query sequence. . .

ardara
Download Presentation

Reference-based Indexing of Sequence Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reference-based Indexing of Sequence Databases University of Florida-Gainesville www.cise.ufl.edu/~jgvenkat VLDB 2006 Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine

  2. Similarity Search Given threshold , find sequences similar to the query sequence. . . . . => Query Similar Sequences Sequence Database, S si sj sk Sequence Query

  3. Measure: Edit Distance Edit Operations: Insert, Delete and Replace. Example: P: ACGTACGTAC_GT | |||| ||| || Q: A_GTACCTACCGT Sequence Length: 12 3 Edit Operations: 2 Insertions and 1 Replace Edit Distance is the minimum number of edit operations needed to transform one sequence to another.

  4. Edit Distance: Complexity Time and space complexity for computing Edit Distance between two sequences is O(n2) . . . . Sequence Database, S |S| = 100,000 Query One Sequence Comparison: 0.25 second. Time taken for single query: 7 hours.

  5. (K + |C|) << |S| Need for Indexing Sequence Database, S Select K sequences as references Candidate Set, C . . Query Query => Pre-compute reference-to-sequence distances

  6. Existing Methods • Hierarchical Methods, • VP-Tree (Yianilos, 1993) • MVP-Tree (Bozkaya et al., 1997) • M-Tree (Ciaccia et al., 1997), • Slim-Tree (Traina et al., 2000), • DF-Tree (Traina et al., 2002). • DBM-Tree (Vieira et al., 2004) • Omni (Filho et al., 2001) • Frequency Vector (Kahveci et al., 2004).

  7. Reference-based Indexing Reference Circle Including Query Database Sequences Reference Sequence Query

  8. Reference-based Indexing Reference Circle Including Query Sequences outside the reference circle (far from the reference) are pruned. Sequences close to the references can also be pruned Database Sequences Reference Sequence Query

  9. Reference-based Indexing Reference Circle Excluding Query Database Sequences Reference Sequence Query

  10. Reference-based Indexing Reference Circle Excluding Query Sequences inside the reference circle (close to the reference) are pruned. Database Sequences Reference Sequence Query

  11. Reference-based Indexing: Bounds Given a sequence s, reference r and query q, Lower Bound: Minimum Distance between q and s with r as reference, |d1-d2|. Upper Bound: Maximum Distance between q and s with r as reference, d1+d2. Upper Bound d1 Lower Bound d2 Database Sequence Reference Sequence Query

  12. Observations • Two types of pruning: • Sequences close to references. • Sequences far from references. • A good reference set should be able to use both kinds of pruning effectively. • Each reference should prune some part of the database not pruned by other references.

  13. Outline • Selection of References • Reference Assignment • Search Algorithm • Experimental Results • Conclusions

  14. Our Contributions • Selection of References: • Maximum Variance Selection: Reference with high variance of distance distributions with other sequences in the database. • Maximum Pruning: A Combinatorial approach of selecting the best reference set. • Assignment of References: • Each sequence has different set of references.

  15. Selection of References:Maximum Variance (MV) Basic Idea: Select references having more sequences close to and far from it, and hence can prune them. Bad Good Database Sequences

  16. Selection of References:Maximum Variance (MV) • Select references having sequences close to and far away from them. • References have maximum variance of distance distributions with other sequences in the database. • New reference prunes some part of the database not pruned by existing set of references.

  17. Maximum Variance: Algorithm Compute Distances Remove Sequences Close to or Far away from New Reference Sort => Random Subset of Sequences Sequence Database Candidate Reference Set Variance of Distance Distributions

  18. Maximum Variance: Example a b f e c g d Database Sequences Maximum Variance Ordering Reference Sequences

  19. Selection of References:Maximum Pruning (MP) • Combinatorial approach to select the best reference set for given query set. • Select reference set that can prune more sequences over all queries. • Sample query set Q’ following the actual query distribution is given. • Sampling techniques to reduce the complexity of this method.

  20. Maximum Pruning: Algorithm GAINS Reference Set Sequence Database Candidate References Sample Queries, Q’

  21. Maximum Pruning: Algorithm GAINS Reference Set Sequence Database Candidate References Sample Queries, Q’

  22. Maximum Pruning: Algorithm GAINS Reference Set Sequence Database Candidate References Sample Queries, Q’

  23. Maximum Pruning: Algorithm GAINS Reference Set Candidate References Sequence Database Sample Queries, Q’

  24. Maximum Pruning: Algorithm GAINS Reference Set Candidate References Sequence Database Sample Queries, Q’

  25. Maximum Pruning: Algorithm GAINS Reference Set Candidate References Sequence Database Sample Queries, Q’

  26. Maximum Pruning: Algorithm GAINS Reference Set Candidate References Sequence Database Sample Queries, Q’

  27. Maximum Pruning: Algorithm GAINS Reference Set Candidate References Sequence Database Sample Queries, Q’

  28. Maximum Pruning: Algorithm GAINS Reference Set MAX() Candidate References Sequence Database Sample Queries, Q’ Repeat Until MAX() > 0

  29. Maximum Pruning Example f Reference Set e a d q d b1 a b3 b2 b c e Database Sequences Reference Sequences Sequences pruned by a

  30. Maximum Pruning Example f Reference Set e d a q d b1 a b3 b2 b c e Database Sequences Reference Sequences Sequences pruned by a

  31. Outline • Selection of References • Assignment of References • Search Algorithm • Experimental Results • Conclusions

  32. Assignment of References Sequence Database, S Select K sequences as references Candidate Set, C Query . . => Query Pre-compute reference-to-sequence distances (K + |C|) << |S| Assign K references to each sequence Increase the Number of references to m (m + |C’|) < (K + |C|) << |S| . . Query Query => Candidate Set, C’

  33. Reference Assignment: Example Number of References = 2 f q1 d ba1 a ba2 q2 c b e q3 bc1 References for b

  34. Outline • Selection of References • Reference Assignment • Search Algorithm • Experimental Results • Conclusions

  35. Search Algorithm Pre-compute Sequence-Reference Distances Compute Query-Reference Distances MAX(LB) MIN(UB) Query, q Upper Bounds Lower Bounds Reference set, V If MAX(LB) ≤ ε ≤ MIN(UB), add s to Candidate set, If ε > MIN(UB), add s to Result set. If ε < MAX(LB), add s to Pruned set. Sequence Database, S

  36. Outline • Selection of References • Reference Assignment • Search Algorithm • Experimental Results • Conclusions

  37. Experimental Setup • Datasets • DNA: Alphabet size of 4 and 20000 sequences. • Protein: Alphabet size of 20 and 4000 sequences of up to 500 amino acids. • Text: Alphabet size of 36 and 8000 sequences of length 100 each. • Size of Reference Set, m = 200. • Experiments, • Comparison with our methods • Maximum Variance with same and different reference sets (MV-S and MV-D). • Maximum Pruning with same and different reference sets (MP-S and MP-D). • Comparison with other methods • Frequency Vector (Kahveci et al., 2004). • Omni (Filho et al., 2001) • Others: M-Tree (Ciaccia et al., 1007), Slim-Tree (Traina et al., 2000), DBM-Tree (Vieira et al., 2004) and DF-Tree (Traina et al., 2002).

  38. Comparison of Our Methods DNA Dataset k = 4

  39. Comparison of Our Methods DNA Dataset Range = 8

  40. Comparison with Other Methods DNA Dataset, k = 16

  41. Conclusion • References selected by Maximum Variance and Maximum Pruning eliminates more database sequences as compared to existing selection strategies. • Assigning different reference set to each sequence dramatically improves the performance. • MP-D outperforms existing methods in almost all the experiments.

  42. Thank You Questions ? jgvenkat@cise.ufl.edu

  43. Comparison with Other Methods: Protein Dataset Query Range = 300

  44. Assignment of References: Memory Limitations • Main memory stores pre-computed reference-to-sequence distances along with the references. • For each [s,vi] pair (s S, vi V), store [i,ED(s,vi)] (Takes 8 bytes). • Given the available main memory in bytes, B B = 8KN + zm N: Number of sequences in the database. K: Number of references per sequence. z: Size of each sequence in bytes. m: Number of references in reference set. • Example: Given B = 1 GB, N = 10 million, z = 100 and m = 1000, then K = 13.

More Related