600 likes | 611 Views
Learn about sequence alignment basics, optimal score computation, Blast database search, expected matches, and the efficiency of Blast compared to Smith-Waterman.
E N D
CSE182-L4: Keyword matching CSE 182
Backward scoring • Defin Sb[i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m] • Q: What is the score of the best alignment of the two strings s and t? • HW: Write the recurrences for Sb CSE 182
Forward/Backward computations j m 1 • F[j] = Score of the best scoring alignment of s[1..n/2] and t[1..j] • F[j] = S[n/2,j] • B[j] = Score of the best scoring alignment of s[n/2+1..n] and t[j+1..m] • B[j] = Sb[n/2,j] n/2 CSE 182
Forward/Backward computations j m 1 • At the optimal coordinate, j • F[j]+B[j]=S[n,m] • In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path. n/2 CSE 182
Forward, Backward computation • There exists a coordinate, j • F[j]+B[j]=S[n,m] • In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path. CSE 182
Linear Space Alignment • Align(1..n,1..m) • For all 1<=j <= m • Compute F[j]=S(n/2,j) • For all 1<=j <= m • Compute B[j]=Sb(n/2,j) • j* = maxj {F[j]+B[j] } • X = Align(1..n/2,1..j*) • Y = Align(n/2..n,j*..m) • Return X,j*,Y CSE 182
Linear Space complexity • T(nm) = c.nm + T(nm/2) = O(nm) • Space = O(m) CSE 182
Summary • We considered the basics of sequence alignment • Opt score computation • Reconstructing alignments • Local alignments • Affine gap costs • Space saving measures • Can we recreate Blast? CSE 182
Blast and local alignment • Concatenate all of the database sequences to form one giant sequence. • Do local alignment computation with the query. CSE 182
Database (n) Query (m) Large database search Database size n=100M, Querysize m=1000. O(nm) = 1011 computations CSE 182
Why is Blast Fast? CSE 182
Silly Question! • True or False: No two people in new york city have the same number of hair CSE 182
Observations • Much of the database is random from the query’s perspective • Consider a random DNA string of length n. • Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 • Assume for the moment that the query is all A’s (length m). • What is the probability that an exact match to the query can be found? CSE 182
Basic probability • Probability that there is a match starting at a fixed position i = 0.25m • What is the probability that some position i has a match. • Dependencies confound probability estimates. CSE 182
Total money you expect to earn Basic Probability:Expectation • Q: Toss a coin: each time it comes up heads, you get a dollar • What is the money you expect to get after n tosses? • Let Xi be the amount earned in the i-th toss CSE 182
Expected number of matches • Expected number of matches can still be computed. i • Let Xi=1 if there is a match starting at position i, Xi=0 otherwise • Expected number of matches = CSE 182
Expected number of exact Matches is small! • Expected number of matches = n*0.25m • If n=107, m=10, • Then, expected number of matches = 9.537 • If n=107, m=11 • expected number of hits = 2.38 • n=107,m=12, • Expected number of hits = 0.5 < 1 • Bottom Line: An exact match to a substring of the query is unlikely just by chance. CSE 182
Suppose we are looking for a database string with greater than 90% identity to the query (length 100) • Partition the query into size 10 substrings. At least one much match the database string exactly Observation 2 • What is the pigeonhole principle? CSE 182
Why is this important? • Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. • Assume that the mismatches are randomly distributed. • What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? • Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. • The above equation does not take dependencies into account • Reality is better because the matches are not randomly distributed CSE 182
Just the Facts • Consider the set of all substrings of the query string of fixed length W. • Prob. of exact match to a random database string is very low. • Prob. of exact match to a true homolog is very high. • Keyword Search (exact matches) is MUCH faster than sequence alignment CSE 182
BLAST Database (n) • Consider all (m-W) query words of size W (Default = 11) • Scan the database for exact match to all such words • For all regions that hit, extend using a dynamic programming alignment. • Can be many orders of magnitude faster than SW over the entire string CSE 182
Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=107,no TP • SW = O(nm) = 1000*107 = 1010 computations • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4)11 * 107=2384 • Number of computations = 2384*100*100=2.384*107 • Ratio=1010/(2.384*107)=420 • Further speed improvements are possible CSE 182
Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. AATCA 567 CSE 182
Related notes • How to choose the alignment region? • Extend greedily until the score falls below a certain threshold • What about protein sequences? • Default word size = 3, and mismatches are allowed. • Like sequences, BLAST has been evolving continuously • Banded alignment • Seed selection • Scanning for exact matches, keyword search versus database indexing CSE 182
P-value computation • How significant is a score? What happens to significance when you change the score function • A simple empirical method: • Compute a distribution of scores against a random database. • Use an estimate of the area under the curve to get the probability. • OR, fit the distribution to one of the standard distributions. CSE 182
Z-scores for alignment • Initial assumption was that the scores followed a normal distribution. • Z-score computation: • For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation • Look up a table to get a P-value CSE 182
Blast E-value • Initial (and natural) assumption was that scores followed a Normal distribution • 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution • Practical consequence: • Longer tail. • Previously significant hits now not so significant CSE 182
Exponential distribution • Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k 1’s • Instead, consider a random binary Matrix. Expected # of diagonals of k 1s CSE 182
As you increase k, the number decreases exponentially. • The number of diagonals of k runs can be approximated by a Poisson process • In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). • As the score increases, the number of alignments that achieve the score decreases exponentially CSE 182
Blast E-value • Choose a score such that the expected score between a pair of residues < 0 • Expected number of alignments with a particular score • For small values, E-value and P-value are the same CSE 182
Blast Variants Longer seeds. Seeds with don’t care values Later Database pre-processing Seeds with don’t care values • What is mega-blast? • What is discontiguous mega-blast? • Phi-Blast/Psi-Blast? • BLAT? • PatternHunter? CSE 182
Silly Quiz • Name a famous Bioinformatics Researcher • Name a famous Bioinformatics Researcher who is a woman CSE 182
Scoring DNA • DNA has structure. CSE 182
DNA scoring matrices • So far, we considered a simple match/mismatch criterion. • The nucleotides can be grouped into Purines (A,G) and Pyrimidines. • Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions) CSE 182
Scoring proteins • Scoring protein sequence alignments is a much more complex task than scoring DNA • Not all substitutions are equal • Problem was first worked on by Pauling and collaborators • In the 1970s, Margaret Dayhoff created the first similarity matrices. • “One size does not fit all” • Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant • Different proteins might evolve at different rates and we need to normalize for that CSE 182
PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] CSE 182
PAM1 matrix • Align many proteins that are very similar • Is this a problem? • PAM1 distance is the probability of a substitution when 1% of the residues have changed • Estimate the frequency Pb|a of residue a being substituted by residue b. • S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) CSE 182
PAM 1 CSE 182
1 PAM 1 PAM PAM distance • Two sequences are 1 PAM apart when they differ in 1% of the residues. • When are 2 sequences 2 PAMs apart? 2 PAM CSE 182
Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 • = PAM1*PAM249 • = PAM1250 CSE 182
Note: This is not the score matrix: What happens as you keep increasing the power? CSE 182
Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb) CSE 182
BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. • In practice BLOSUM62 seems to work very well. CSE 182
PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 CSE 182
Dictionary Matching, R.E. matching, and position specific scoring CSE 182
Keyword search • Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. • Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match. CSE 182
Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? database dictionary CSE 182
Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? • Trivial algorithm O(nm) time • Pre-processing O(m), Search O(n) time. • Dictionary matching • Trivial algorithm (l1+l2+l3…)n • Using a keyword tree, lpn (lp is the length of the longest pattern) • Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case CSE 182
Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. CSE 182
O A P M S T T T O T S I U A E The Trie Automaton • Construct an automaton A from the dictionary • A[v,x] describes the transition from node v to a node w upon reading x. • A[u,’T’] = v, and A[u,’S’] = w • Special root node r • Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v 1 r S 2 w 3 CSE 182