150 likes | 162 Views
Learn how substitution matrices are used to calculate similarity scores for amino acids, based on substitution probabilities, frequencies, and evolutionary time spans. Explore the principles behind PAM and BLOSUM matrices.
E N D
Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07
Substitution matrices All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability pa,b of substitution a -> b Idea: it is more reasonable to align amino acids that are frequently (with high probability) replaced by each other!
Substitution matrices Compute similarity score s(a,b) for amino acids a and b: • Probability pa,b of substitution a → b (or b → a), • Frequency qaof a Define s(a,b) = log (pa,b / qa qb)
Substitution matrices • Estimate probability pa as relative frequency of a (possibly with pseudo counts) • Estimate probability pa,b of substitution a -> b based on observed substitutions in real-world sequences
Substitution matrices Simplifying assumptions: • Consider evolution as a random process: substitution a -> b occurs with probability pa,b depending on a and b • pa,b =pa,b (t), i.e. probabilitydepends on time span t in evolution since sequences originated from common ancester • pa,b does not depend on sequence position • Sequence positions independent of each other • pa,b =pb,a (symmetry!)
Substitution matrices Result: PAM matrix (Dayhoff et al.)
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a)
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTSRQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMSQRREALYWIALY
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices Problems involved: • Probability pa,b depends on time t since sequences separated in evolution: pa,b =pa,b (t). But: pa,b (t) not linear in t for large t • Alignment of protein families must be known! • Multiple mutations at one sequence position • Protein families contain multiple sequences: phylogenetic tree must be known!
Substitution matrices • Solution for 1. – 3. (time dependence, alignment, multiple mutations) • Look at small evolutionary distances first, normalize for distance = 1 PAM (= percentage accepted mutations) • Calculate substitution matrices for larger distances based on small distances • Solution for 4 (tree must be known): Use parsimony to find tree M. Dayhoff et al. (1978), Atlas of Protein sequence and Structure: PAM matrices
Substitution matrices Calculation of pa,b(t) : • Consider multiple alignments of closely related protein families • Count substitutions a->b (or b->a) in alignments based on phylogenetic tree • Estimate pa,b(t) for small times t • Normalize to distance t = 1 PAM (percentage of accepted mutations) • Calculate conditional probabilities p(a|b,t) for small t • Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication • Calculate pa,b(t) for larger evolutionary distances
Substitution matrices Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments. • Cluster of sequences if percentage of similarity > L • Estimate pa,b(t) directly. Default values: L = 62, L = 50