Precomputing Edit-Distance Specificity of Short Oligonucleotides

Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Polymerase Chain Reaction

Primer Specificity • Need to ensure that primers hybridize to a specific (specified) locus only • Require exactly one occurrence of specified sequence • Require no (potential) mis-hybridization loci • Bottleneck computation in primer-design • Design / check iteration is problematic

k-unique 20-mers • Edit-distance as a surrogate for mis-hybridization potential • k-unique loci: • All non-self genomic loci are require more than k edits in (global) alignment • Closest non-self genomic loci requires (k+1) edits in (global) alignment

Find all k-unique 20-mers • Naïve algorithm: O(n2km) • Quadratic in size of genome. • 0-unique (exact match) 20-mers • (Expected) linear time algorithm • Achieve expected linear time using a hybrid approach (blastn): • Use partial exact match to “seed” expensive dynamic programming alignment • Large chunks ) Fast, but miss occurrences • Small chunks ) Slow, but correct

Inexact sequence match Baeza-Yates Perleberg: • Correct and O(n) for small k • At least 1 chunk is observed with no error. • Small k → Large chunks → Fast and correct • Largest correct chunk: floor(m/(k+1)) g ≠ = ≠ q

Example worst case alignments TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT ACTTGTCCACAGTGCTTAAG ||||||*||||||*|||||| ACTTGTGCACAGTCCTTAAG

AA:18 AC:1,9 AG:11,19 CA:8,10 CC:14 CT:2,15 GC:7 GT:5,12 TA:17 TC:13 TG:4,6 TT:3,16 Brute-force approach ACTTGTGCACAGTCCTTAAG 2-mer position table

Brute-force approach ACTTGTGCACAGTCCTTAAG ACTTGTGCACAGTCCTTAAG

Brute-force approach Divide the genome into 10 Mb blocks For all pairs of blocks: For all l-mer matches: Do all pair-wise DPs containing match If ≤ k edits, mark position non-unique 300 x 300 pairs of blocks For 20-mers: k=1 )l=10; k=2 )l=6; k=3 )l=5 ; k=4 )l=4.

Brute-force approach Things are looking really, really, bad: • Seeds are too short • 90,000 pair-wise block comparisons Actually quite good (seed size 12): • Non-uniqueness certificates are dense • Almost all positions eliminated early • Behaves more like linear time than quadratic

In practice (edit-dist 4)

In practice (edit-dist 3)

In practice (edit-dist 4,3,2)

Edit distance 2 • After seed size 12 • ~ 27K (0.288%) positions have no match • After seed size 8 • ~ 3K (0.029%) positions have no match • Using seed size 6 is still too slow • Need a more sophisticated hashing strategy • 6-mers match in too many places!

Spaced seed-set design problem • Given: • mer-size: m ( = 20 ) • # errors: k ( = 1,2,3) • # cares: l ( = 10,12,14 ) • Find the smallest set of spaced seeds that will find all alignments.

Solution for (20,2,8) • 11111111, 111101111 TCCCGCGTAGATTGAGATCT ||||||*||||||*|||||| TCCCGCCTAGATTTAGATCT • How can we find these spaced seed set solutions?

Spaced seed set design set-cover formulation • Set cover instance: • Ground set: all possible placements of the k errors (alignments) • Covering sets: all possible placements of the l care positions • For (m=20,k=2,l=10), • 190 elements, 184,756 sets! • Need to reduce the number of sets!

Dirty secret of spaced seeds • Spaced seeds take O(# cares) to update! • Contiguous seeds are O(1) to update • 101010101010101 vs 11111111 • 8 steps to update vs 1 step to update • Constant time update for spaced seeds? • Yes, if they have a certain structure

O(1) spaced seed update ACGTACGTACGTACGTACGT A G A G C T C T G A G A T C T C ... Spaced seed 1010101 can be updated in 1 step!

O(1) spaced seed update • “Periodic” spaced seeds can be updated in “constant” time • 11011011011 2 steps • 11001100110011 2 steps • 1000010000100001 1 step • Need to minimize the number of update steps, not the number of templates • 11111111,111101111 has update cost 5.

Edit-distance SS-SDP • Position of matching bases might shift! • Need 11111111 ↓ to get CCGCTAGA • Need 111101111 ↑ to get CCGCTAGA • Set cover formulation no longer works TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| TCCCGCCTAGATTTAGATCT

Edit-Distance SS-SDP • Use a variation on set cover: • q:111101111,r:11111111 covers: • Pay for query & reference update costs separately • Control size of problem by only enumerating templates with small update cost r:TCCCGC-TAGATTGAGATCT ||||||v||||||*|||||| q:TCCCGCCTAGATTTAGATCT

Solution for (20,2,10) Query Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 27: 11111000001111100000 Cost: 5 42: 11111000000001111100 Cost: 5 Text Templates: 1: 11111111110000000000 Cost: 1 2: 11111011111000000000 Cost: 5 32: 11111000000111110000 Cost: 5 37: 11111000000011111000 Cost: 5 42: 11111000000001111100 Cost: 5 Pairs of templates: 1: 11111111110000000000 1: 11111111110000000000 Covers: 1274 2: 11111011111000000000 1: 11111111110000000000 Covers: 260 2: 11111011111000000000 2: 11111011111000000000 Covers: 1218 1: 11111111110000000000 2: 11111011111000000000 Covers: 309 42: 11111000000001111100 32: 11111000000111110000 Covers: 42 27: 11111000001111100000 32: 11111000000111110000 Covers: 319 42: 11111000000001111100 37: 11111000000011111000 Covers: 186 27: 11111000001111100000 37: 11111000000011111000 Covers: 51 42: 11111000000001111100 42: 11111000000001111100 Covers: 287

k-unique human 20-mers • No 4-unique 20-mers • No 3-unique 20-mers • 0. 038% of (forward) human 20-mers are 2-unique • 1088322 in total • about 1 every 2638 bases • Fast 2-uniquness oracle

F. tularensis 20-mer signatures • Exact match in all six strains • No match to bacterial background at edit-distance k • No 3-unique 20-mer signatures • 263 2-unique 20-mer signatures • 0.013% • 1.3M 20-mer signatures (no background check) • 1.2M 0-unique 20-mer signatures • 580K 1-unique 20-mer signatures

Conclusions • Precompute of human k-unique 20-mers is now feasible! • Faster for large edit-distance! • Need spaced seed-set designs • Constant time update for spaced seeds • Good integer programming formulation of SS-SDP • Limited template enumeration based on update cost • Work with integer programming experts to solve effectively

Next Steps • Publish! • Adapt for Tm and/or hybridization model • Convert to native BOINC-application • Integrate with primer-design software

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Presentation Transcript

SHORT DISTANCE WIRELESS COMMUNICATION

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Short-Distance Structure of Nuclei

Minimum Edit Distance

Efficient Approximation of Edit Distance

Specificity

Levels of Specificity

Minimum Edit Distance

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Minimum Edit Distance

Precomputing Edit-Distance Specificity of Short Oligonucleotides

Dynamic Programming: Edit Distance

Pair HMMs and edit distance

Edit Distance