Calculating substitution matrices

Calculating substitution matrices http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5Two models one random (R) and one match (M) for sequence alignmentThe random model assumes that letter a occurs independently with some frequency qa, the probability of the two sequences is just the product of the probabilities of each amino acid:P(x,y|R) =PiqxiPjqyj

Odds ratio • The match model aligns residues with a joint probability pab • P (x,y|M) = Pipxiyi • The ratio of match to random is known as odds ratio: • P(x,y|M)/P(x,y|R) = Pi (pxiyi/qxiqyi)

Log odds ratio • s(a, b) = log (pab/qaqb) • S = Si s(xi, yi) • This last equation is the sum of individual scores for each aligned pair of residues. The first equation refers to scores in a matrix, for instance, proteins exhibit a 20 X 20 matrix known as a score or substitution matrix. (BLOSUM, PAM)

Significance of scores using alignment algorithms • Calculate a raw Score • Sum of scores for each letter to letter and letter to null position • Calculate a bit score • Normalizes for scoring system used • Calculate an E-value • Calculated from bit score to account for probability the hit arose by chance

Raw score • Calculated from substitution matrices (PAM, BLOSUM), and gap costs • There are substitution matrices for nucleotides also: • States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.

Bit score • S’ = (lS – lnK)/ ln 2 • lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed • Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. • http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda • Gap costs – the standard cost associated with a gap of length g

Gap costs • Can be linear – like we did in our matrix g(g) = -gd • Can be an “affine” score – most prevalent now g(g) = -d – (g-1)e Where d is called the gap-open penalty and e is called the gap-extension penalty. The gap extension penalty e is usually less than the d, allowing long insertions and deletions to be penalized less

E - value • E = N/2S’ • This is an approximation for the number (E) of distinct HSP’s with normalized score at least S’ expected to occur by chance when two random protein sequences of sufficient lengths m and n are compared • N = mn (search space size)

Database searching • If a protein is compared to whole database, n is the database length in residues • The equation can be converted to: • S’ = log2(N/E) • If a protein of length 250 might be compared to a protein database of 5 x 106 residues, to achieve a marginally significant E-value of 0.05 a normalized score of 38 bits is necessary

Significance of E - value • E value is between 1 and 0 • The lower the E value the more significant the match • Note that the E value is dependent on the length of query sequence – An E value of .05 is more significant for a query of 100 amino acids, than 200 amino acids

Calculating substitution matrices

Calculating substitution matrices

Presentation Transcript

[MATRICES ]

Matrices

Matrices

MATRICES

Position-Specific Substitution Matrices

Scores and substitution matrices in inexact matching (sequence alignment)

Scores and substitution matrices in sequence alignment

Substitution

Substitution

Substitution

Substitution

Substitution

Tutorial 4 Substitution matrices and PSI-BLAST

Substitution

Scores and substitution matrices in sequence alignment

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Substitution matrices

Alignment Statistics and Substitution Matrices

Substitution

Substitution Matrices and Alignment Statistics

Substitution