440 likes | 456 Views
Explore the concept of homology in sequences, assess functional similarity, Amino Acids Sequence Alignment, Scoring System, and more.
E N D
Sequence similarity search II Searching for remote homologies
(How) can we decide if two sequences have the same function? Homolog = come from a common origin => have the same function
Homologous proteins = come from a common origin => have the same function Last Universal Common Ancestor
Homology Rule of thumb:-Proteins are homologous if 25%-35% identical -DNA sequences are homologous if 70% identical Can we always go by the rules?
Alignment between the worm and human arrestin VERY SIGNIFICANT , NOT HIGH IDENTITY
Assessing whether proteins are functional homologous High levels of a protein RBP4 (Retinol binding protein 4) were found to be correlated with childhood obesity RBP4= carrier of vitamin A in the blood
Assessing whether proteins are functional homologous RBP4= carrier of vitamin A in the blood RBP4 (retinol binding) and PAEP (pregnancy protein) E value= 0.49; identity=24% Are they functionally homologous???
The lipocalins protein family (each dot is a protein) PAEP RBP4 retinol-binding protein odorant-binding protein apolipoprotein D
Are they functionally homologous??? PAEP RBP4 They belong to the same protein family= have a common ancestor Their functions have probably diverse
Sequence Alignment based on AA similarity TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || + |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |++ | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD ++||| | +++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity 45/178=25% + = similarity 63/178=35%
How do we define the scoring system Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other Protein X e-coli Protein X yeast Protein X worm Protein X Chicken Protein X Mice Protein X Pig Protein X Monkey Protein X Human ...M G Y D E ...M G Y D E …..M G Y E E …..M G Y D E …..M G Y Q E …..M G Y D E …..M G Y E E …..M G Y E E In this column E & D are found 7/8
COO- +H3N C H HCH C C O O- O O- D / E COO- +H3N C H HCH HCH Aspartate (Asp, D) Glutamate (Glu, E)
PAM - Point Accepted Mutations • Developed by Margaret Dayhoff, 1978. • Analyzed very similar protein sequences “Accepted” mutations – do not negatively affect a protein’s fitness • Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Margaret Dayhoff 1925-1983
Basic matrix normalized probabilities multiplied by 10000 Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
Log Odds Matrices • PAM matrices converted to log-odds matrix • Calculate odds ratio for each substitution • Taking scores in previous matrix • Divide by frequency of amino acid • Convert ratio to log10 and multiply by 10 • Take average of log odds ratio for converting A to B and converting B to A • Result: Symmetric matrix
PAM250 Log odds matrix Entry (i,j): the score of aligning amino acid i against amino acid j. Simliar aa have high score Entry (i,i) is greater than any entry (i,j), ji. The entries on the diagonal are not always identical
The different PAM Matrices • There are different PAM matrices (PAM 1- PAM250). The matrices are derived from each other by multiplying the PAM1 matrices N times • Low PAM matrices are suitable for strong local similarities (Arrestin worm vs Arrestin Human) • High PAM matrices are suitable for weak similarities (RBP4 and PEAP) • PAM120 recommended for general use (40% identity) • PAM60 for close relations (60% identity) • PAM250 for distant relations (20% identity)
BLOSUM=BLOcks SUstitution MatrixSteven and Jorga G. Henikoff (1992) • Based on BLOCKS database (Families of proteins with identical function) • Highly conserved protein domains • Ungappedlocal alignment to identify motifs • Each motif is a block of local alignment • Counts amino acids observed in same column • Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC
BLOSUM Matrices BLOSUM 62 • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on blocks that are at most n percent identical.
Selecting a BLOSUM Matrix • For BLOSUMn, higher n suitable for sequences which are more similar • BLOSUM62 recommended for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations
QUIZ • The score for ARG-LYS in BLOSUM 45 is 3, what will the score for the same pair in BLOSUM 80? • 2 • 3 • 4 • -1
Remote homologues • Sometimes BLAST isn’t enough. • When searching homologs in large and diverse protein families and/or when looking for homology in non highly conserved proteins in very far species (e-coli vs human) PSI-BLAST
PSI-BLAST General Idea : - Builds specialized scoring matrices which are specific to the family of interest - Generates a position specific scoring matrix Page 138
PSI-BLAST STEPS: [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a specialized multiple sequence alignment [3] Creates a “profile” or the specialized alignment for each position independently position-specific scoring matrix (PSSM) Page 138
R,I,K C D,E,T K,R,T N,L,Y,G
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0
PSI-BLAST Continue… [4] The PSSM is used as a query against the database [5] PSI-BLAST estimates statistical significance (E values) [6] Repeat steps [4] and [5] iteratively, typically 3-5 times. At each new search, a new profile is used as the query. Page 138
Searching for remote homology using PSI-BLAST
PSI-BLAST alignment of RBP (retinol binding protein) and b-lactoglobulin: iteration 1 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82 Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 2 Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P + Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK +++++ + Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112 Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159
PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
1 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82 Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
The lipocalins protein family (each dot is a protein) B-lactoglobulin retinol-binding protein odorant-binding protein apolipoprotein D RBP4
The universe of lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D
Scoring matrices let you focus on the big (or small) picture retinol-binding protein
Scoring matrices let you focus on the big (or small) picture PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM retinol-binding protein retinol-binding protein
PSI-BLAST -PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. -The main source of false positives is the spurious amplification of sequences not related to the query. -Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away. Page 144
PSI-BLAST Three approaches to prevent false positive results: [1] Apply filtering [2] Adjust E value to a lower value [3] Visually inspect the output from each iteration. Remove suspicious hits. Page 144