260 likes | 538 Views
Designing Multiple Simultaneous Seeds for DNA Similarity Search. Yanni Sun , Jeremy Buhler Washington University in Saint Louis. Outline. Problem of multi-seed design Methods Greedy covering algorithm Compute conditional match probabilities Experiments and results
E N D
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis
Outline • Problem of multi-seed design • Methods • Greedy covering algorithm • Compute conditional match probabilities • Experiments and results • Conclusion and future work WashU. Laboratory for Computational Genomics
Sequence Alignment • Functional regions conserved despite DNA mutations over time • Conserved region can be aligned with high score • Exact solution: DP; time complexity: O(MN) • Fast but heuristic solution: seeded alignment algorithm WashU. Laboratory for Computational Genomics
TAGGACCTAACC GACCACCTTTT Seeded Alignment Algorithm • BLAST is the most popular tool. Step 1: word matchstep 2: extend the match to find the high similarity pair TAGGACCTAACC GACCACCTTTT WashU. Laboratory for Computational Genomics
Seed and Similarity • Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag Similarity: 101101000010111100 Seed: 11*1, weight = 3, span = 4 The seed detects/matchesthis similarity. WashU. Laboratory for Computational Genomics
Seed Choice is Important 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Significant alignment Seed match WashU. Laboratory for Computational Genomics
Seed Design: Previous Work • Traditional seed: word (e.g. 11111111111) • Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111} • Our work on single discontiguous seed: [BKS’03] WashU. Laboratory for Computational Genomics
Multiple Simultaneous Seeds • Multiple simultaneous seeds are defined as a set of seeds. • ∏= {seed1, seed2,…seed i,…, seedn} • ∏ detects a similarity if at least one of the component seeds detects the similarity • Example • Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001 WashU. Laboratory for Computational Genomics
Multi-seed Design – Balance Sensitivity with Specificity • Sensitivity=A / Biologically meaningful alignments • Specificity=A / seed matches • Increase sensitivity: • Decrease weight of single seed • Use multiple seeds • Both methods hurt specificity • Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed biologically meaningful alignments A seed matches
Our Work – Design Multiple Simultaneous Seeds Efficiently • Use a new local search method to optimize seed set • Design an efficient algorithm to calculate conditional match probability • Empirical verification that multiple simultaneous seeds have better tradeoff of sensitivity vs. specificity WashU. Laboratory for Computational Genomics
Multi-seed Design Problem • Input: • Ungapped alignments sampled from two genomic DNA sequences • Resource constraints of seeds: weight, span, number • Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S]. • Pr(∏ detects S) = Pr( (seed1 detects S) or (seed2 detects S)…or (seedn detects S))
Outline • Problem of multi-seed Design • Methods • Greedy covering algorithm • Compute conditional match probabilities • Experiments and results • Conclusion and future work WashU. Laboratory for Computational Genomics
Computing Match Probability for Specified Seeds [BKS ’03] • Learn a kth-order Markov model from similarities. • Build a DFA that only accepts strings containing the given seeds • Compute the probability that the DFA accepts a string chosen randomly from model M by DP. WashU. Laboratory for Computational Genomics
Seek the Locally Optimal Set of Seeds • Original local search • Greedy covering algorithm – a faster local search strategy • Efficient computation of conditional match probability WashU. Laboratory for Computational Genomics
1***1*1, 1*****11 Pr=0.75 1**1**1, 1*****11 Pr=0.67 1****11, 1*****11 Pr=0.71 Find Optimal Set of Seeds by Original Local Search Seed space with span<=8,weight=3 1*1***1, 1*****11 Pr=0.70 WashU. Laboratory for Computational Genomics
Similarities detected by S1 Similarities detected by S2 Similarities detected by S3 Greedy Covering Algorithm Similarity space Design 3 simultaneous seeds:{s1,s2,s3} s1= argmaxxPr(x) s2=argmaxx Pr(x|~s1) s3=argmaxx Pr(x|~{s1,s2}) WashU. Laboratory for Computational Genomics
Calculate Conditional Match Probabilities • Challenge: how to calculate the conditional probability efficiently ? • Seeds with small span: exact computation via DFAs • Seeds with large span: Monte Carlo WashU. Laboratory for Computational Genomics
Calculate Conditional Match Probability via DFA • Pr( x| ) = Pr(x )/ Pr( ) • Build DFA corresponding to x by using cross product and complementation of DFA • Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed WashU. Laboratory for Computational Genomics
Outline • Problem of multi-seed design • Methods • Greedy covering algorithm • Compute conditional match probabilities • Experiments and results • Conclusion and future work WashU. Laboratory for Computational Genomics
Greedy Covering vs. Original Local Search Detection probability
Greedy Covering is Much Faster • When n=5, on the same hardware platform(P4) • Greedy covering needs 20 minutes • The original local search needs 2.4 hours WashU. Laboratory for Computational Genomics
Experimental Setup • The ungapped alignments are sampled uniformly from human and mouse syntenies • For a specified seed set • sensitivity : the number of significant gapped alignments found by our BLAST-like alignment tool • False positive rate : approximated by the number of seed matches WashU. Laboratory for Computational Genomics
Results: Verify the Hypothesis on Noncoding Sequences WashU. Laboratory for Computational Genomics
Summary of Contributions • Efficient algorithms to design multiple simultaneous seeds at reasonable cost • Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity WashU. Laboratory for Computational Genomics
Future Work • Design a better evaluation platform for different seeds • Investigate utility of seeds in multiple sequence alignment WashU. Laboratory for Computational Genomics
Acknowledgements • Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope • Laboratory for computational genomics in Washington University in Saint Louis http://www.cse.wustl.edu/~jbuhler/mandala WashU. Laboratory for Computational Genomics