1 / 23

Compositionally Adjusted Substitution Matrices for Protein Database Searches

Compositionally Adjusted Substitution Matrices for Protein Database Searches. Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health. Collaborators. Yi-Kuo Yu Alejandro Sch ä ffer John Wootton Richa Agarwala

Download Presentation

Compositionally Adjusted Substitution Matrices for Protein Database Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compositionally Adjusted Substitution Matrices for Protein Database Searches Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health

  2. Collaborators Yi-Kuo Yu Alejandro Schäffer John Wootton Richa Agarwala Mike Gertz Aleksandr Morgulis National Center for Biotechnology Information National Library of Medicine National Institutes of Health See: Yu, Wootton & Altschul (2003) PNAS100:15688-15693; Yu & Altschul (2005) Bioinformatics 21: 902-911; Altschul et al. (2005) FEBS J. 272:5101-5109.

  3. Log-odds scores The scores of any local-alignment substitution matrix can be written in the form where the piare background amino acid frequencies, the qij are target frequencies and λ is an arbitrary scale factor. (PNAS87:2264-2268)

  4. The BLOSUM-62 matrix PNAS89:10915-10919 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V

  5. Amino acid compositional bias Some sources of bias: Organismal bias AT-rich genome:tend to have more amino acids FLINKYM GC-rich genome: tend to have more amino acids PRAWG Protein family bias Transmembrane proteins: more hydrophobic residues Cysteine-rich proteins: more Cysteines than usual

  6. Construction of an asymmetric log-odds substitution matrix Given a (not necessarily symmetric) set of target frequencies qij, define two sets of background frequencies pi and p’j as the marginal sums of the qij: The substitution scores are then defined as We call this matrix valid in the context of the pi and p’j.

  7. Substitution matrix validity theorem A substitution matrix can be valid for only a unique set of target and background frequencies, except in certain degenerate cases.(Proof omitted) One can determine efficiently whether an arbitrary substitution matrix can be valid in some context and, if so, one can extract its unique target and background frequencies, and scale. (Proof and algorithms omitted)

  8. Choosing new target frequencies Given new sets of background frequencies Pi and P’j, how should one choose appropriate target frequencies Qij ? Consistency constraints: Close to original qij: Sometimes, it is desirable to constrain the relative entropy H

  9. Substitution matrices compared Mode A: Standard BLOSUM-62 matrix. Mode B: Composition-adjusted matrix; no constraint on relative entropy (H). Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats). Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

  10. Performance evaluation(mode D vrs. mode A)

  11. BLOSUM-62 and sequence specific background frequencies Amino P. falciparumM. tuberculosis Acid BLOSUM62 #16805184 #15607948 ----- --------------- --------------- ----------------- A 7.4 4.8 13.9 R 5.2 4.1 7.4 N 4.5 8.9 2.8 D 5.3 5.6 5.9 C 2.5 2.1 1.9 Q 3.4 3.0 3.6 E 5.4 7.0 6.1 G 7.4 6.2 9.5 H 2.6 3.1 1.7 I 6.8 9.0 4.4 L 9.9 8.2 9.3 K 5.8 8.2 1.9 M 2.5 1.3 1.5 F 4.7 5.1 2.5 P 3.9 3.8 5.3 S 5.7 7.4 4.4 T 5.1 2.3 5.7 W 1.3 1.0 0.8 Y 3.2 4.6 2.8 V 7.3 4.4 8.7

  12. Difference between a scaled, standard BLOSUM-62 and a compositionally adjusted BLOSUM-62 P. falciparum A -15-55-116 -76 -45 -60 -73 -23 -98 -54 -45 -77 -39 -92 -34 -52 -31 -79-102 -34 R 4 -9 -83 -43 -20 -26 -40 2 -66 -27 -16 -40 -9 -61 -5 -25 -3 -49 -71 -8 N 48 22 -26 6 24 16 3 50 -20 15 25 -2 33 -19 38 21 44 -8 -28 34 D 19 -8 -62 -12 -5 -12 -20 20 -51 -11 -3 -30 4 -47 12 -8 13 -37 -58 6 C 21 -14 -74 -34 19 -20 -35 15 -57 -10 -1 -37 5 -47 6 -11 12 -35 -59 9 Q 23 -2 -65 -24 -3 1 -19 21 -46 -9 1 -24 11 -45 14 -6 15 -30 -54 8 E 22 -5 -66 -20 -6 -7 -14 19 -48 -12 -1 -27 6 -46 12 -8 14 -34 -56 7 G -25 -59-115 -77 -52 -64 -77 -14-102 -61 -51 -80 -45 -94 -38 -57 -37 -82-106 -42 H 54 27 -31 6 30 23 9 52 2 21 33 4 40 -9 44 25 46 1 -14 39 I 26 -6 -67 -25 5 -11 -26 21 -50 9 13 -28 17 -35 14 -7 20 -28 -50 23 L 23 -7 -70 -29 2 -13 -27 18 -51 1 15 -31 17 -35 12 -10 16 -29 -51 16 K 43 20 -45 -5 17 13 -2 41 -29 11 20 2 28 -25 33 14 36 -13 -34 29 M 30 1 -62 -23 8 -4 -20 25 -44 5 18 -23 31 -31 18 -2 23 -22 -46 22 F 62 34 -29 12 41 26 13 61 -7 39 51 9 55 17 52 32 56 19 -1 55 P -31 -62-123 -80 -56 -66 -80 -34-106 -64 -54 -84 -48 -99 -23 -61 -39 -88-110 -44 S 19 -14 -72 -32 -6 -18 -32 15 -57 -17 -8 -35 0 -51 7 -7 12 -41 -62 2 T 11 -21 -78 -41 -12 -26 -39 6 -65 -19 -11 -42 -3 -57 0 -17 12 -46 -67 -1 W 60 31 -32 7 39 26 10 59 -12 31 42 6 49 4 48 28 52 37 -5 47 Y 52 23 -38 0 29 17 2 48 -13 23 34 -1 39 -1 40 20 44 9 -5 41 V 13 -21 -83 -42 -10 -28 -41 6 -67 -11 -6 -44 0 -52 -1 -22 4 -46 -65 9 A R N D C Q E G H I L K M F P S T W Y V Entries shown: score of standard matrix subtracted from the adjusted one

  13. Optimal alignments implied by modes A and D Mode A: 29.7 bits(H = 0.51 nats) Mode D: 31.8 bits (H = 0.51 nats) Mode C: 33.1 bits (H = 0.44 nats)

  14. Substitution matrices compared Mode A: Standard BLOSUM-62 matrix. Mode B: Composition-adjusted matrix; no constraint on relative entropy (H). Mode C: Composition-adjusted matrix; H constrained to equal a constant (0.44 nats). Mode D: Composition-adjusted matrix; H constrained to equal that of the standard matrix in the new compositional context.

  15. Performance of various matrices on 143 pairs of related sequences(FEBS J. 272:5101-5109)

  16. Empirical rules for invoking compositional adjustment when comparing two sequences 1: The length ratio of the longer to the shorter sequence is less than 3.

  17. One metric definition of distance between two composition vectors (IEEE Trans. Info. Theo. 49:1858-1860)

  18. Empirical rules for invoking compositional adjustment when comparing two sequences 1: The length ratio of the longer to the shorter sequence is less than 3. 2: The distanced between the compositions of the two sequences is less than 0.16.

  19. Law of cosines In a triangle with sides of length a,b and c, the angle opposite the side of length c is

  20. Empirical rules for invoking compositional adjustment when comparing two sequences 1: The length ratio of the longer to the shorter sequence is less than 3. 2: The distanced between the compositions of the two sequences is less than 0.16. 3: The angleθmade by the compositions of the two sequences with the standard composition is less than 70o.

  21. ROCn curves for Aravind set (NAR29: 2994-3005) b

  22. ROCn curves for SCOP set (Proc IEEE9: 1834-1847)

  23. Future directions • Possible less extensive use of SEG when compositional adjustment is invoked. • Application to PSI-BLAST.

More Related