1 / 26

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Designing Multiple Simultaneous Seeds for DNA Similarity Search. Yanni Sun , Jeremy Buhler Washington University in Saint Louis. Outline. Problem of multi-seed design Methods Greedy covering algorithm Compute conditional match probabilities Experiments and results

Jims
Download Presentation

Designing Multiple Simultaneous Seeds for DNA Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

  2. Outline • Problem of multi-seed design • Methods • Greedy covering algorithm • Compute conditional match probabilities • Experiments and results • Conclusion and future work WashU. Laboratory for Computational Genomics

  3. Sequence Alignment • Functional regions conserved despite DNA mutations over time • Conserved region can be aligned with high score • Exact solution: DP; time complexity: O(MN) • Fast but heuristic solution: seeded alignment algorithm WashU. Laboratory for Computational Genomics

  4. TAGGACCTAACC GACCACCTTTT Seeded Alignment Algorithm • BLAST is the most popular tool. Step 1: word matchstep 2: extend the match to find the high similarity pair TAGGACCTAACC GACCACCTTTT WashU. Laboratory for Computational Genomics

  5. Seed and Similarity • Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag Similarity: 101101000010111100 Seed: 11*1, weight = 3, span = 4 The seed detects/matchesthis similarity. WashU. Laboratory for Computational Genomics

  6. Seed Choice is Important 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Significant alignment Seed match WashU. Laboratory for Computational Genomics

  7. Seed Design: Previous Work • Traditional seed: word (e.g. 11111111111) • Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111} • Our work on single discontiguous seed: [BKS’03] WashU. Laboratory for Computational Genomics

  8. Multiple Simultaneous Seeds • Multiple simultaneous seeds are defined as a set of seeds. • ∏= {seed1, seed2,…seed i,…, seedn} • ∏ detects a similarity if at least one of the component seeds detects the similarity • Example • Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001 WashU. Laboratory for Computational Genomics

  9. Multi-seed Design – Balance Sensitivity with Specificity • Sensitivity=A / Biologically meaningful alignments • Specificity=A / seed matches • Increase sensitivity: • Decrease weight of single seed • Use multiple seeds • Both methods hurt specificity • Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed biologically meaningful alignments A seed matches

  10. Our Work – Design Multiple Simultaneous Seeds Efficiently • Use a new local search method to optimize seed set • Design an efficient algorithm to calculate conditional match probability • Empirical verification that multiple simultaneous seeds have better tradeoff of sensitivity vs. specificity WashU. Laboratory for Computational Genomics

  11. Multi-seed Design Problem • Input: • Ungapped alignments sampled from two genomic DNA sequences • Resource constraints of seeds: weight, span, number • Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S]. • Pr(∏ detects S) = Pr( (seed1 detects S) or (seed2 detects S)…or (seedn detects S))

  12. Outline • Problem of multi-seed Design • Methods • Greedy covering algorithm • Compute conditional match probabilities • Experiments and results • Conclusion and future work WashU. Laboratory for Computational Genomics

  13. Computing Match Probability for Specified Seeds [BKS ’03] • Learn a kth-order Markov model from similarities. • Build a DFA that only accepts strings containing the given seeds • Compute the probability that the DFA accepts a string chosen randomly from model M by DP. WashU. Laboratory for Computational Genomics

  14. Seek the Locally Optimal Set of Seeds • Original local search • Greedy covering algorithm – a faster local search strategy • Efficient computation of conditional match probability WashU. Laboratory for Computational Genomics

  15. 1***1*1, 1*****11 Pr=0.75 1**1**1, 1*****11 Pr=0.67 1****11, 1*****11 Pr=0.71 Find Optimal Set of Seeds by Original Local Search Seed space with span<=8,weight=3 1*1***1, 1*****11 Pr=0.70 WashU. Laboratory for Computational Genomics

  16. Similarities detected by S1 Similarities detected by S2 Similarities detected by S3 Greedy Covering Algorithm Similarity space Design 3 simultaneous seeds:{s1,s2,s3} s1= argmaxxPr(x) s2=argmaxx Pr(x|~s1) s3=argmaxx Pr(x|~{s1,s2}) WashU. Laboratory for Computational Genomics

  17. Calculate Conditional Match Probabilities • Challenge: how to calculate the conditional probability efficiently ? • Seeds with small span: exact computation via DFAs • Seeds with large span: Monte Carlo WashU. Laboratory for Computational Genomics

  18. Calculate Conditional Match Probability via DFA • Pr( x| ) = Pr(x )/ Pr( ) • Build DFA corresponding to x by using cross product and complementation of DFA • Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed WashU. Laboratory for Computational Genomics

  19. Outline • Problem of multi-seed design • Methods • Greedy covering algorithm • Compute conditional match probabilities • Experiments and results • Conclusion and future work WashU. Laboratory for Computational Genomics

  20. Greedy Covering vs. Original Local Search Detection probability

  21. Greedy Covering is Much Faster • When n=5, on the same hardware platform(P4) • Greedy covering needs 20 minutes • The original local search needs 2.4 hours WashU. Laboratory for Computational Genomics

  22. Experimental Setup • The ungapped alignments are sampled uniformly from human and mouse syntenies • For a specified seed set • sensitivity : the number of significant gapped alignments found by our BLAST-like alignment tool • False positive rate : approximated by the number of seed matches WashU. Laboratory for Computational Genomics

  23. Results: Verify the Hypothesis on Noncoding Sequences WashU. Laboratory for Computational Genomics

  24. Summary of Contributions • Efficient algorithms to design multiple simultaneous seeds at reasonable cost • Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity WashU. Laboratory for Computational Genomics

  25. Future Work • Design a better evaluation platform for different seeds • Investigate utility of seeds in multiple sequence alignment WashU. Laboratory for Computational Genomics

  26. Acknowledgements • Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope • Laboratory for computational genomics in Washington University in Saint Louis http://www.cse.wustl.edu/~jbuhler/mandala WashU. Laboratory for Computational Genomics

More Related