150 likes | 256 Views
Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07. Substitution matrices. All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability p a,b of substitution a -> b
E N D
Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07
Substitution matrices All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability pa,b of substitution a -> b Idea: it is more reasonable to align amino acids that are frequently (with high probability) replaced by each other!
Substitution matrices Compute similarity score s(a,b) for amino acids a and b: • Probability pa,b of substitution a → b (or b → a), • Frequency qaof a Define s(a,b) = log (pa,b / qa qb)
Substitution matrices • Estimate probability pa as relative frequency of a (possibly with pseudo counts) • Estimate probability pa,b of substitution a -> b based on observed substitutions in real-world sequences
Substitution matrices Simplifying assumptions: • Consider evolution as a random process: substitution a -> b occurs with probability pa,b depending on a and b • pa,b =pa,b (t), i.e. probabilitydepends on time span t in evolution since sequences originated from common ancester • pa,b does not depend on sequence position • Sequence positions independent of each other • pa,b =pb,a (symmetry!)
Substitution matrices Result: PAM matrix (Dayhoff et al.)
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a)
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTSRQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMSQRREALYWIALY
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY
Substitution matrices Problems involved: • Probability pa,b depends on time t since sequences separated in evolution: pa,b =pa,b (t). But: pa,b (t) not linear in t for large t • Alignment of protein families must be known! • Multiple mutations at one sequence position • Protein families contain multiple sequences: phylogenetic tree must be known!
Substitution matrices • Solution for 1. – 3. (time dependence, alignment, multiple mutations) • Look at small evolutionary distances first, normalize for distance = 1 PAM (= percentage accepted mutations) • Calculate substitution matrices for larger distances based on small distances • Solution for 4 (tree must be known): Use parsimony to find tree M. Dayhoff et al. (1978), Atlas of Protein sequence and Structure: PAM matrices
Substitution matrices Calculation of pa,b(t) : • Consider multiple alignments of closely related protein families • Count substitutions a->b (or b->a) in alignments based on phylogenetic tree • Estimate pa,b(t) for small times t • Normalize to distance t = 1 PAM (percentage of accepted mutations) • Calculate conditional probabilities p(a|b,t) for small t • Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication • Calculate pa,b(t) for larger evolutionary distances
Substitution matrices Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments. • Cluster of sequences if percentage of similarity > L • Estimate pa,b(t) directly. Default values: L = 62, L = 50