420 likes | 494 Views
Precomputing Edit-Distance Specificity of Short Oligonucleotides. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Polymerase Chain Reaction. Polymerase Chain Reaction. Primer Specificity.
E N D
Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
Primer Specificity • Need to ensure that primers hybridize to a specific (specified) locus only • Require exactly one occurrence of specified sequence • Require no (potential) mis-hybridization loci • Bottleneck computation in primer-design • Design / check iteration is problematic
k-unique 20-mers • Edit-distance as a surrogate for mis-hybridization potential • k-unique loci: • All non-self genomic loci are require more than k edits in (global) alignment • Closest non-self genomic loci requires (k+1) edits in (global) alignment
Find all k-unique 20-mers • Naïve algorithm: O(n2km) • Quadratic in size of genome. • 0-unique (exact match) 20-mers • (Expected) linear time algorithm • Achieve expected linear time using a hybrid approach (blastn): • Use partial exact match to “seed” expensive dynamic programming alignment • Large chunks ) Fast, but miss occurrences • Small chunks ) Slow, but correct
Inexact sequence match Baeza-Yates Perleberg: • Correct and O(n) for small k • At least 1 chunk is observed with no error. • Small k → Large chunks → Fast and correct • Largest correct chunk: floor(m/(k+1)) g ≠ = ≠ q
Example worst case alignments TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG
AA:18 AC:1,9 AG:11,19 CA:8,10 CC:14 CT:2,15 GC:7 GT:5,12 TA:17 TC:13 TG:4,6 TT:3,16 Brute-force approach ACTTGTGCACAGTCCTTAAG 2-mer position table
Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG
Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG
Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG
Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG
Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG
Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG
Brute-force approach Divide the genome into 10 Mb blocks For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 )l=10; k=2 )l=6; k=3 )l=5 ; k=4 )l=4.
Brute-force approach Things are looking really, really, bad: • Seeds are too short • 90,000 pair-wise block comparisons Actually quite good (seed size 12): • Non-uniqueness certificates are dense • Almost all positions eliminated early • Behaves more like linear time than quadratic
Edit distance 2 • After seed size 12 • ~ 27K (0.288%) positions have no match • After seed size 8 • ~ 3K (0.029%) positions have no match • Using seed size 6 is still too slow • Need a more sophisticated hashing strategy • 6-mers match in too many places!
Spaced seed-set design problem • Given: • mer-size: m ( = 20 ) • # errors: k ( = 1,2,3) • # cares: l ( = 10,12,14 ) • Find the smallest set of spaced seeds that will find all alignments.
Solution for (20,2,8) • 11111111, 111101111 TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT • How can we find these spaced seed set solutions? • One/two table? 2 x false positives
Spaced seed set design set-cover formulation • Set cover instance: • Ground set: all possible placements of the k errors (alignments) • Covering sets: all possible placements of the l care positions • For (m=20,k=2,l=10), • 190 elements, 184,756 sets! • Need to reduce the number of sets!
Dirty secret of spaced seeds • Spaced seeds take O(# cares) to update! • Contiguous seeds are O(1) to update • 101010101010101 vs 11111111 • 8 steps to update vs 1 step to update • Constant time update for spaced seeds? • Yes, if they have a certain structure(s) • Restrict spaced seeds to small update cost
O(1) spaced seed update ACGTACGTACGTACGTACGT 1: A G A G 2: C T C T ... Spaced seed 1010101 can be updated in 1 step!
O(1) spaced seed update ACGTACGTACGTACGTACGT ACGTACG -> ACGACG CGTACGT -> CGTCGT ... Spaced seed 1110111 can be updated in 1 step!
O(1) spaced seed update • “Period” step update • 11001100110011 2 steps • 1000010000100001 1 step • “Runs” step update • 11100111111 1 step • 11101110111 2 steps • Minimize the number of update steps • Weighted set-cover instance…
Edit-distance SS-SDP • Position of matching bases might shift! • Need 11111111 ↓ to get CCGCTAGA • Need 111101111 ↑ to get CCGCTAGA • Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT
Edit-Distance SS-SDP • Use a variation on set cover: • q:111101111,r:11111111 covers: • Pay for query & reference update costs separately • Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT
Correct solutions for 1-unique 20-mers 107 random sequence x 107 random sequence • Seed: 1111111111 (Best single seed solution, weight 10) • ~ 9.5 expensive dynamic programs per locus • Seed set: 11111111111;111110111111 (weight 11) • ~ 8.9 DP/locus • Seed set (weight 11) • ~ 7.8 DP/locus • Seed set: 111111111111;1111101111111 (weight 12) • ~ 2.2 DP/locus 11111111111 ~ 111101111111111101111111 ~ 11111111111 111101111111 ~ 111101111111
Correct solutions for 1-unique 20-mers 107 random sequence x 107 random sequence • Seed set: 111111111111;1111101111111 (weight 12) • ~ 2.5 DP/locus • Seed set (weight 12) • ~ 1.8 DP/locus • Seed set: 1111111111111;11111101111111 (weight 13) • ~ 0.56 DP/locus (same specificity as contiguous seed weight 12) • Seed set (weight 14) • ~ 0.20 1111110111111 x 111111111111 1111110111111 111111001111111 11111111111111 ~ 11111111111111 111110111111111 ~ 11111111111111 111111111110111 ~ 11111111111111 111111111110111 ~ 111111111110111 11111111111111 ~ 111111101111111111110111111111 ~ 111110111111111
Correct solutions for 2-unique 20-mers • Seed: 111111 (Best single seed solution, weight 6) • ~ 2439 DP/locus • Weight 10 – 73 DP/locus (specificity of 8/9 contig seed) • Weight 12 – 10 DP/locus (specificity of 10 contig seed) 1111111111 ~ 1111111111 11111011111 ~ 1111111111 11111011111 ~ 11111011111 1111100000011111 ~ 1111100000011111 11111000000011111 ~ 1111100000011111 111110000000011111 ~ 1111100000011111 111110000011111 ~ 1111100000011111 111110000000011111 ~ 11111000000000011111
k-unique human 20-mers • No 4-unique 20-mers • No 3-unique 20-mers • 0. 038% of (forward) human 20-mers are 2-unique • 1088322 in total • about 1 every 2638 bases • Fast 2-uniquness oracle
Genome Browser Track Edit Distance: • UCSC Track: 1-unique 20-mers • UCSC Track: 2-unique 20-mers
Integration with High Performance Computing • Break sequence into chunks of size 107. • Remember which positions have been eliminated. • Integrated with (UMIACS) Condor • Too unreliable for very large sequences • NFS filesystem is unreliable • Simultaneous jobs capped at ~ 300 • Integrated with hadoop/map-reduce on 80 nodes (Edwards Lab) • Reliability improved, DFS solves (some) filesystem woes • Much better scalability (in theory, yet to be tested) • Explicit synchronization of reduce step is undesirable.
Other improvements • Other groupings are possible: • Species designation on FASTA defline can be any suitable partition • Constraints on the position of edits: • False positive due to mishybridization at 3’ end is unlikely to be observed with some technologies • Constraint on valid Tm range: • Computed as in Primer3 • Can eliminate undesirable mers early
Conclusions • Precompute of human k-unique 20-mers is now feasible! • Faster for large edit-distance! • Need spaced seed-set designs • Constant time update for spaced seeds • Good integer programming formulation of SS-SDP • Limited template enumeration based on update cost • Work with integer programming experts to solve effectively