340 likes | 517 Views
Seeds for Similarity Search. Presentation by: Anastasia Fedynak. Homology Search. Homology search consumes 10% of the world’s supercomputing time NCBI Blast server processes 10 5 queries/day GenBank doubles in size every 18 months Completed genomes: human, mouse, rice, fly, etc
E N D
Seeds for Similarity Search Presentation by: Anastasia Fedynak
Homology Search • Homology search consumes 10% of the world’s supercomputing time • NCBI Blast server processes 105 queries/day • GenBank doubles in size every 18 months • Completed genomes: human, mouse, rice, fly, etc • Software must be scalable for large datasets
Homology Search Tools • Identify short seed matches (consecutive k bases) between DNA sequences which are then extended • BLAST, FASTA too slow and miss many alignment • Smith-Waterman DP too slow • MegaBlast high speed, works well for highly similar sequences
Discontiguous Seeds • Requires matching pairs of bases at a subset of positions • Califanoand Rigoutsos (1993) • Random discontiguous pattern in FLASH • Buhler (2001) • Sensitivity of random patterns in LSH-ALL-PAIRS comparison algorithm • Blastz underlying PipMaker program (2000) • PatternHunter (Ma, Tromp, and Li, 2002)
Resource-constrained paradigm of seed design Given a collection of ungapped genomic sequence similarities of fixed length l, modeled by kth-order Markov model, M, find n seeds π1 … πn, such that the probability of detecting a similarity is maximized
Problem Definition • Let C be collection of genomic sequences of l bases 1 = match 0 = mismatch • Starting point for gapped extension similarity
Problem Definition • Similarity is modeled by kth order markov process, M • Gives the probability the next bit seen will be a 1 (match) • Coding regions exhibit the pattern {1, 1, 0}, protein with silent mutations at 3rd base position of codon
Problem Definition • Devise a seed π, an ordered list of w positions {x1…xw},with weight w and span s • Ex. π = {1,3,4,6,7} w=5, s=7 • π detects S iff at offset j S[j+ xi] = 1 for 1 ≤ i ≤ w i.e. For every position of π, at offset j, S must contain matching bases S = 1011011 match S = 1001011 mismatch
Problem Definition • Find a seed π, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [π detects S ] S~M
Selecting Good Seeds • Seed length determined by a tradeoff between speed and sensitivity: • Larger k = fast speed, low sensitivity • Small k = slow speed, high sensitivity • Blast uses k consecutive letters as seeds • k =11 in Blastn and k =28 in MegaBlast
Selecting Good Seeds • INDEPENDENCE: probabilities of matches at different offsets are not independent • Generally, fewer bases shared between seed and shifted copies, higher sensitivity • Consecutive models → low sensitivity
PatternHunter • Optimal model via DP : 111010010100110111 • w = 11, s = 18 • shifted copy shares 5 bases
PatternHunter • Optimal model via DP : 111010010100110111 111010010100110111 • w = 11, s = 18 • shifted copy shares 5 bases
Spaced vs. Consecutive Seeds LEMMA: Expected number of hits with weight w, span s, within a length l region of similarity 0 ≤ p ≤ 1 is: (l – s + 1)pw Example: In a region of length 64 and similarity 0.7
Mandala – Seed Selection • Let π = {x1…xw} be the current seed • Define local neighbourhood of π as set of all seeds π’ that differ from π in one position. • Hill climbing with random restart to find a near-optimal seed • Evaluation based on probability calculation
Detection Probabilities • Detection probability encodes overlap structure of a seed into DFA • DP computes the probability DFA accepts a random similarity of length l from kth-order Markov model, M • P(q,t,δ·b) probability of reaching state q after reading t bits of an input S, the last k+1 of which are δ·b. For a state q, let Φb(q) is the set of all states that transition to q on bit b. P(q,t,δ·b) = Pr(S[t]=b|S[t-k’…t-1] = δ) x ∑ ∑ P(q’,t-1,b0· δ) q’ЄΦb(q) b0Є{0,1}
Influence of Model Order M5 model (solid line) exploits nearest-neighbor Mc5 model (Dashed line) – exploits correlation arising from codon structure
Multi-Seed Design – Why? • Seed matching heuristics optimize a tradeoff between sensitivity (true +ve rate) and specificity (1 – false +ve rate) • True +ve: alignment contains a seed match • False +ve: Prob match occurs by chance (~ 1/4w bases) • Increase w • reduces π’s false +ve • But lowers sensitivity
Multi-Seed Design – Why? • Multiple seeds provide a more attractive way to trade sensitivity for specificity • Set ∏ of seeds with weight w’ > w • Expected chance matches is: |∏|/4w’
Problem Definition • A seed π matches alignment α → Eπ(α) • Mismatch → Eπ(α) • Match probability of π in M is given by: Pr (Eπ(α)) • A set ∏ matches α, if at least one of its seeds matches (E∏(α)) α ~M
Problem Definition • Find a set П of n seeds, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [ П detects S ] S~M
Algorithms for Multi-Seed Design • Local Approach • Used in Mandala • Greedy Covering • Beam Search
Mandala’s Local Search Algorithm • Given w and s • Begin with a set ∏0 of n randomly chosen seeds with common w and s • Choose i and j, where 1≤i≤n and 2≤j≤w, then, find the best seed set ∏1 in the neighbourhood of ∏0 by deleting position xj of the ith seed πiЄ ∏0 , and replacing it with a position between 1 and s-1 not currently inspected by πi • Iterates through i and j until no further improvements are possible
Greedy Heuristic for Computing Seed Sets • Given a partial seed set ∏0, choose the next seed that maximizes the conditional match probability for alignment model M: Pr(Eπ|E∏) • i.e. highest-probability alignment not already matched by some seed in the current set • Start from a single locally optimal seed
Extension to Beam Search • Initially find a number of locally optimal single seeds • The best b are saved and used in the next optimization round • For each saved seed, we find N seeds, each of which locally optimizes Pr(Eπ|E∏) • The b seed pairs {π0, π} with highest match probability over all b·N pairs are again saved. • Best seed set overall is choosen
Computing Conditional Match Probabilities • Construct DFA, Aπ that accepts alignments containing a seed match to π • By DP, compute Pr Aπ accepts a random alignment of length l from M • Compute Pr(Eπ|E∏) for seed π and set ∏ Pr(Eπ|E∏) = Pr(E∏Uπ) - Pr(E∏) 1 - Pr(E∏)
Detection Probabilities • Let π be a seed weight w span s • Qπ set of all s-bit strings matching π • Construct a trie Tπfrom the strings of Qπ • Convert Tπto DFA Aπ(Aho-Corasick alg) • accepts a similarity S, if π detects S