Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

Substitution matrices All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability pa,b of substitution a -> b Idea: it is more reasonable to align amino acids that are frequently (with high probability) replaced by each other!

Substitution matrices Compute similarity score s(a,b) for amino acids a and b: • Probability pa,b of substitution a → b (or b → a), • Frequency qaof a Define s(a,b) = log (pa,b / qa qb)

Substitution matrices • Estimate probability pa as relative frequency of a (possibly with pseudo counts) • Estimate probability pa,b of substitution a -> b based on observed substitutions in real-world sequences

Substitution matrices Simplifying assumptions: • Consider evolution as a random process: substitution a -> b occurs with probability pa,b depending on a and b • pa,b =pa,b (t), i.e. probabilitydepends on time span t in evolution since sequences originated from common ancester • pa,b does not depend on sequence position • Sequence positions independent of each other • pa,b =pb,a (symmetry!)

Substitution matrices Result: PAM matrix (Dayhoff et al.)

Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a)

Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTSRQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMSQRREALYWIALY

Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY

Substitution matrices Problems involved: • Probability pa,b depends on time t since sequences separated in evolution: pa,b =pa,b (t). But: pa,b (t) not linear in t for large t • Alignment of protein families must be known! • Multiple mutations at one sequence position • Protein families contain multiple sequences: phylogenetic tree must be known!

Substitution matrices • Solution for 1. – 3. (time dependence, alignment, multiple mutations) • Look at small evolutionary distances first, normalize for distance = 1 PAM (= percentage accepted mutations) • Calculate substitution matrices for larger distances based on small distances • Solution for 4 (tree must be known): Use parsimony to find tree M. Dayhoff et al. (1978), Atlas of Protein sequence and Structure: PAM matrices

Substitution matrices Calculation of pa,b(t) : • Consider multiple alignments of closely related protein families • Count substitutions a->b (or b->a) in alignments based on phylogenetic tree • Estimate pa,b(t) for small times t • Normalize to distance t = 1 PAM (percentage of accepted mutations) • Calculate conditional probabilities p(a|b,t) for small t • Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication • Calculate pa,b(t) for larger evolutionary distances

Substitution matrices

Substitution matrices Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments. • Cluster of sequences if percentage of similarity > L • Estimate pa,b(t) directly. Default values: L = 62, L = 50

Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

Presentation Transcript

Vorlesung Grundlagen der Bioinformatik gobics.de/lectures/ss07/grundlagen

Burkhard Morgenstern Institut f ür Mikrobiologie und Genetik

Editors-in-Chief: Burkhard Morgenstern (Germany) Peter Stadler (Germany)

Vorlesung Grundlagen der Bioinformatik gobics.de/lectures/ss07/grundlagen