200 likes | 322 Views
1. BLAST (Basic Local Alignment Search Tool) Heuristic. Only parts of protein are frequently subject to mutations. For example, active sites (that one which binds with other proteins) are not while regions between active sites (which help to shape the protein) are.
E N D
1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one which binds with other proteins) are not while regions between active sites (which help to shape the protein) are. Smith-Waterman algorithm is appropriate to detect these parts, except that it runs in o(n2). When comparing a sequence with a database of sequences one claims an algorithm which runs in o(n). Thus heuristics are required: those sequences to which heuristics assign high scores are then rescored with Smith-Waterman algorithm.
BLAST(cont.1) • BLAST relies on two assumptions: • high scoring alignments contain one or more high scoring pairs of three letters sub-strings (words, in BLAST language) which leads to seeds (couples of words obtained from comparison with the database sequence). • homologous proteins contain significant gap-free alignments (most recent versions sequences select sequences on a gap-free base, but then applies an optimized version of Smith-Waterman algorithm). • BLAST returns a list of high-scoring segment pairs (gap-free alignments) between the query sequence and sequences in the database. • It proceeds in two steps.
Preprocessing of the query string Construction of a hash table (query index) with 20*20*20 = 8000 possible keys (all possible words) of three letters. Value of a key is the list of all positions in the query string of all words that give a high score if aligned (without gap) with the key word. Example: Query string: CINCINNATI Key: NCT If good scores are over 11 (with the scoring matrix BLOSUM62, it is a good threshold) then the value of the key is [3, 7]. That is because from BLOSUM62, the alignment between NCI (position 3) and NCT scores 6 + 9 - 1 = 14 and the alignment between NAT (position 7) score 6 + 0 + 5 = 11.
Scanning the target string The first goal is to identify hits: location of high scoring word pairs. Example: the BLOSUM62 scoring matrix is consideredand a good score is over 11. Query string: CINCINNATI Target string : PRECINCTS Value of the key NCT: [3, 7] Two hits are found for NCT: (3, 6) and (7, 6) Then the hits are extended (right and left) to whatever positions maximizes the overall alignment score ( a tolerance can be introduced). Example: we get at query position 3 the segment pair (CINCIN, CINCTS) with score 9 + 4 + 6 + 9+(-1) + 1 = 28, but the hit at position 7 (NAT, NCT) cannot be extended
MSP (Maximum Segment Pair) A maximum segment pair between two sequences is a segment pair of maximum score. Its score (MSP score) is a measure of sequence similarity (which has to be appreciated statistically). Alignments between the query sequence and target sequences which are reported come from MSPs.
2.Statistics of BLAST Database Searches The goal is to appreciate statistically how good is a (non gaped) local alignment which is given for example by BLAST (empirically, gaped local alignments seem to behave similarly). First, we study a case where substitution matrices (like PAM) do not intervene (typically DNA sequences) and then consider the general case.
Scores for random DNA The issue: Q: query sequence, D: target sequence, score of a match = 1, score of a mismatch = -1. A gap-free local assignment is discovered by BLAST (for example ) whose score is 13. Does it means that Q and D are related or simply that it occurs « by chance » ? The solution: Express the probability that an alignment with such a score occurs by chance and give a way to estimate and to compute this probability. Lower is this probability and higher we would believe that Q and D are related.
Scores for random DNA(cont.1) a) The first goal is to approximate the probability Ps(i, j ) that a local alignment terminating in positions i and j has a score better than s by a function of the score s: Ps(i, j) = Pr( best(i, j) s ) for s = 0, 1, 2… where best(i, j) is the best score of a local alignment ending at the ith position of D and the jth position of Q. « by chance » is interpreted as Q and D are « random sequence », ie. (1) each position is equally likely A, C, G, T and (2) each position is independent of every other; Because of (1): Pr( Di = Qj ) = 1 - Pr( Di # Qj ) = 1/4 Because of (2): Ps(i, j) = Pr( Di = Qj ) * Pr( best(i, j) s | Di = Qj ) + Pr( Di # Qj ) * Pr( best(i, j) s | Di # Qj )
Scores for random DNA(cont.2) The conditional probabilities are easily expressed in term of the Ps(i, j)’s. Pr( best(i, j) s | Di = Qj ) = Ps-1(i-1, j-1) Pr( best(i, j) s | Di # Qj ) = Ps+1(i-1, j-1) So Ps(i, j) could be defined by: Ps(i, j) = (3/4) * Ps+1(i-1, j-1) + (1/4)*Ps-1(i-1, j-1) with P0(i, j) = 1 Also, in our case, Ps(i, j) = 0 if i < s or j < s.
Scores for random DNA(cont.3) Then, one can compute for example P2(3, 5): P2(3, 5) = (3/4) * P3(2, 4) + (1/4)*P1(2, 4) = 0 + (1/4)*P1(2, 4) = (1/4)*((3/4) * P2(1, 3) + (1/4)*P0(1, 3)) = (1/4)*(1/4)*1 =1/16 Note that P2(3, 5) and P2(5, 3) would give the same answer. It appears that Ps(i, j) = Ps(j, i) = Ps(m, m) = Ps, m where m = min(i, j).
Scores for random DNA(cont.4) We get: Ps,m= (3/4) * Ps+1,m-1+ (1/4)* Ps-1,m-1 A majorant (we are cautious) of Ps,m is Ps = limm-> Ps,m As limm-> Ps,m = (3/4) * limm-> Ps+1,m-1+ (1/4)* limm-> Ps-1,m-1 then one can study Ps by using: Ps = (3/4) * Ps+1 + (1/4)* Ps-1 whose characteristic equation is 3x2 –4x +1 =0 with roots r0 = 1 and r1 = 1/3 So Ps = c0 r0s + c1 r1s = c0 + c1 (1/3)s where c0 and c1 are rational.
Scores for random DNA(cont.5) To determine c0 and c1 , we use P0 and P1 such that: P0 = 1 = c0 + c1 P1 = s1Ps- s2Ps = (1/4)* (P0 + P1 ) = 1/3 = c0 + (1/3)* c1 We getc0 = 0 andc1 = 1 So a majorant of Ps is (1/3)s which is a good estimate of Ps(i, j) unless i or j is quite small.
Scores for random DNA(cont.6) b) The second goal is to approximate the probability that a high score occurs at any of the positions (i, j). If Zsmn is the sum of the positions (i, j) such that best-score(i, j) s. This probability is: Pr( Zsmn 1) We use the following Markov inequality to get a majorant of Pr (Zsmn 1): Pr( Zsmn *E[Zsmn]) 1/ E[Zsmn] is the expected number of position pairs with score exceeding s. The goal is then to express it in terms of m, n and s. Doing so, we obtain 1/ .
Scores for random DNA(cont.7) E[Zsmn] = E [i=1m j=1n I sij ]where I sij (best(i, j ) s = i=1m j=1n E[I sij ] i=1m j=1n Ps,min(i, j) = m* n * Ps We get: Pr( Zsmn 1) Pr( Zsmn (m* n * Ps)-1*E[Zsmn]) And then Pr( Zsmn 1) m* n * Ps E[Zsmn] : E-value E Pr( Zsmn 1) : P-value P
Scores for random DNA(cont.8) • If the scores at the different positions were independent, statistics says that we would have (extreme value distribution) : P = 1 – e-E • Accepting this approximation, we get: P = 1 –(1-E + E2/2) + … • P and E are about the same when small (within 5% when either is less than O.1). • Conclusion from our example: • D and Q have a free-gapped alignment whose score is 13: • |D| = |Q| = 103 then E ~ 1O-3* 1O-3 * (1/3)13 ~ 1 as (1/3)13 ~ 1O-6 . The score 13 is not surprising. • |D| = |Q| = 102then E ~ 1O-2* 1O-2 * 1O-6 = 1O2 . • D and Q seem related.
BLAST scores for random residues This time, amino acid sequences have non-uniform background probabilities and scoring matrices. Example: suppose we have only three amino acids A, B and C with pA = 1/3, pB = 1/6, pC = 1/2 and the following scoring matrix sc:
BLAST scores for random residue (cont.1) As before the issue is to approximate Ps, m = Pr(best(m, m) s) by a function of s. We get: Ps, m = Pr( Qm =A Dm =A )*Pr(best(m, m) s | Qm =A Dm =A ) + Pr( Qm =A Dm =B )*Pr(best(m, m) s | Qm =A Dm =B ) +… = (1/3) * (1/3) * Ps-sc(A,A), m-1 + (1/3) * (1/6)* Ps-sc(A,B), m-1 +… = (1/9) * Ps-2, m-1 + (1/9) * Ps+3, m-1 +…
BLAST scores for random residue (cont.2) Passing to the limit, we get: Ps =(4/9)* Ps+3 + (1/6) *Ps + (5/18) Ps-1 + (1/9) Ps-2 Then the characteristic polynomial is: 8*x5 -15*x2 + 5*x + 2 = 0 And finally: Ps ~(0.9).(0.6287)s
3 BLAST output Roughly speaking, BLAST defines E-values such that: E ~ Kmne-s where K = c1 an = - ln r1 E-value must be lower than 10-4 to be considered for homology. It also introduce a bit score s’ such that E = m*n*2 –s’ Or s’ = (s – ln K)/ln 2 The higher the bit score, the more similar the sequences. Match below 50 are very unreliable.
BLAST output(cont.1) Are also given: -a listing of the database sequences with high-scoring alignments, along with the bit scores and the E-values of the alignments. -the alignments themselves with detailes (for example the number of identical residue pairs and with positive scores) -statistics related to the query and database as a whole. For example, the values of and of K for the substitution matrix used for the alignment (P ~ K *e- s)