1 / 15

Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

Learn how substitution matrices are used to calculate similarity scores for amino acids, based on substitution probabilities, frequencies, and evolutionary time spans. Explore the principles behind PAM and BLOSUM matrices.

jenoch
Download Presentation

Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Burkhard Morgenstern Grundlagen der Bioinformatik Subsitutionsmatrizen SS 07

  2. Substitution matrices All protein alignment programs depend on similarity scores s(a,b) Similarity score s(a,b) for amino acids a and b is based on probability pa,b of substitution a -> b Idea: it is more reasonable to align amino acids that are frequently (with high probability) replaced by each other!

  3. Substitution matrices Compute similarity score s(a,b) for amino acids a and b: • Probability pa,b of substitution a → b (or b → a), • Frequency qaof a Define s(a,b) = log (pa,b / qa qb)

  4. Substitution matrices • Estimate probability pa as relative frequency of a (possibly with pseudo counts) • Estimate probability pa,b of substitution a -> b based on observed substitutions in real-world sequences

  5. Substitution matrices Simplifying assumptions: • Consider evolution as a random process: substitution a -> b occurs with probability pa,b depending on a and b • pa,b =pa,b (t), i.e. probabilitydepends on time span t in evolution since sequences originated from common ancester • pa,b does not depend on sequence position • Sequence positions independent of each other • pa,b =pb,a (symmetry!)

  6. Substitution matrices Result: PAM matrix (Dayhoff et al.)

  7. Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a)

  8. Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTSRQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMSQRREALYWIALY

  9. Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY

  10. Substitution matrices To calculate pa,b: Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY

  11. Substitution matrices Problems involved: • Probability pa,b depends on time t since sequences separated in evolution: pa,b =pa,b (t). But: pa,b (t) not linear in t for large t • Alignment of protein families must be known! • Multiple mutations at one sequence position • Protein families contain multiple sequences: phylogenetic tree must be known!

  12. Substitution matrices • Solution for 1. – 3. (time dependence, alignment, multiple mutations) • Look at small evolutionary distances first, normalize for distance = 1 PAM (= percentage accepted mutations) • Calculate substitution matrices for larger distances based on small distances • Solution for 4 (tree must be known): Use parsimony to find tree M. Dayhoff et al. (1978), Atlas of Protein sequence and Structure: PAM matrices

  13. Substitution matrices Calculation of pa,b(t) : • Consider multiple alignments of closely related protein families • Count substitutions a->b (or b->a) in alignments based on phylogenetic tree • Estimate pa,b(t) for small times t • Normalize to distance t = 1 PAM (percentage of accepted mutations) • Calculate conditional probabilities p(a|b,t) for small t • Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication • Calculate pa,b(t) for larger evolutionary distances

  14. Substitution matrices

  15. Substitution matrices Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments. • Cluster of sequences if percentage of similarity > L • Estimate pa,b(t) directly. Default values: L = 62, L = 50

More Related