100 likes | 175 Views
Calculating substitution matrices.
E N D
Calculating substitution matrices http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5Two models one random (R) and one match (M) for sequence alignmentThe random model assumes that letter a occurs independently with some frequency qa, the probability of the two sequences is just the product of the probabilities of each amino acid:P(x,y|R) =PiqxiPjqyj
Odds ratio • The match model aligns residues with a joint probability pab • P (x,y|M) = Pipxiyi • The ratio of match to random is known as odds ratio: • P(x,y|M)/P(x,y|R) = Pi (pxiyi/qxiqyi)
Log odds ratio • s(a, b) = log (pab/qaqb) • S = Si s(xi, yi) • This last equation is the sum of individual scores for each aligned pair of residues. The first equation refers to scores in a matrix, for instance, proteins exhibit a 20 X 20 matrix known as a score or substitution matrix. (BLOSUM, PAM)
Significance of scores using alignment algorithms • Calculate a raw Score • Sum of scores for each letter to letter and letter to null position • Calculate a bit score • Normalizes for scoring system used • Calculate an E-value • Calculated from bit score to account for probability the hit arose by chance
Raw score • Calculated from substitution matrices (PAM, BLOSUM), and gap costs • There are substitution matrices for nucleotides also: • States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.
Bit score • S’ = (lS – lnK)/ ln 2 • lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed • Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. • http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda • Gap costs – the standard cost associated with a gap of length g
Gap costs • Can be linear – like we did in our matrix g(g) = -gd • Can be an “affine” score – most prevalent now g(g) = -d – (g-1)e Where d is called the gap-open penalty and e is called the gap-extension penalty. The gap extension penalty e is usually less than the d, allowing long insertions and deletions to be penalized less
E - value • E = N/2S’ • This is an approximation for the number (E) of distinct HSP’s with normalized score at least S’ expected to occur by chance when two random protein sequences of sufficient lengths m and n are compared • N = mn (search space size)
Database searching • If a protein is compared to whole database, n is the database length in residues • The equation can be converted to: • S’ = log2(N/E) • If a protein of length 250 might be compared to a protein database of 5 x 106 residues, to achieve a marginally significant E-value of 0.05 a normalized score of 38 bits is necessary
Significance of E - value • E value is between 1 and 0 • The lower the E value the more significant the match • Note that the E value is dependent on the length of query sequence – An E value of .05 is more significant for a query of 100 amino acids, than 200 amino acids