220 likes | 233 Views
Explore the concepts of homologs, orthologs, paralogs, and analogs in protein sequences and alignments using clustering algorithms for bioinformatics analysis. Dive into motifs, domains, and hidden Markov models for a deeper understanding of proteins.
E N D
Patterns and Profiles Lisa Mullan, HGMP-RC
Terminology Homologs Two proteins that share a common ancestor • Usually similar functions • Orthologs : different species • Paralogs : same genome Analogs • Two sequences that have NO common • ancestor, but have similar functions. Protein • analogs may have the same fold.
7 10 Multiple sequence alignments CHERRIES CLEMENTIN-ES P-EAR--S GRE-ENAPPLES Most programs use “clustal” – a clustering algorithm
4 24 Multiple sequence alignments P-EARS----- GREENAPPLES CLEMENTINES CHERR--I-ES
0 24 Multiple sequence alignments GREENAPPLES CHERR---IES P-EARS----- CLEMENTINES
GREENAPPLES CLEMENTINES CHERRIES PEARS GREENAPPLES CLEMENTINES CHERR---IES P-EARS----- Multiple sequence alignments (cont.)
Multiple sequence alignments (cont.) CLUSTAL W (1.7) multiple sequence alignment Q40236/1-193 GTF-DQLQLVLRWPTSFCNGKNCKRTPKDFTIHGLWPDSEAGELNFCNPRASYTIVRHGTF Q40241/1-189 -----QLQLVLRWPTSFCNGKNCKRTPKDFTIHGLWPDSEAGELNFCNPRASYTIVRHGTF Q42513/1-193 GTF-NQLQLVLRWPASFCKGKKCERTPNNFTIHGLWPDIKGTILNNCNPDAKYASVTGGKF G255586/1-194 GAF-EYMQLVLQWPTAFCHTTPCKNIPSNFTIHGLWPDNVSTTLNFCGKEDDYNIIMDGP- Q40379/1-194 GAF-EYMQLVLQWPTTFCHTTPCKNIPSNFTIHGLWPDNVSTTLNFCGKEDDYNIIMDGP- :****:**::**: . *:. *.:********* . ** *. .* : * Q40236/1-193 EKRN---KHWPDLMRSKDNSMDNQEFWKHEYIKHGSCCTDLFNETQYFDLALVLKDRFDLLT Q40241/1-189 EKRN---KHWPDLMRSKDNSMDNQEFWKHEYIKHGSCCTDLFNETQYFDLALVLKDRFDLLT Q42513/1-193 VKRN---KHWPDLILTEAASLNSQGFWAYQFKKHGTCCSDLFNQEKYFDLALILKDKFDLLT G255586/1-194 EK-NGLYVRWPDLIREKADCMKTQNFWRREYIKHGTCCSEIYNQVQYFRLAMALKDKFDLLT Q40379/1-194 EK-NGLYVRWPDLIREKADCMKTQNFWRREYIKHGTCCSEIYNQVQYFRLAMALKDKFDLLT :** :****: : .:..* ** :: ***:**::::*: :** **: ***:***** Q40236/1-193 TFRIHGIVPRSSHTVDKIKKTIRSVTGVLPNLSCTKNMDLLEIGICFNREASKMIDCTRP Q40241/1-189 TFRIHGIVPRSSHTVDKIKKTIRSVTGVLPNLSCTKNMDLLEIGICFNREASKMIDCTRP Q42513/1-193 TFRNKGIIPKSTCTINKIQKTIRTVTGVVPNLSCTPTMELLEVGICFNRDASKLIDCDQP G255586/1-194 SLKNHGIIRGYKYTVQKINNTIKTVTKGYPNLSCTKGQELWEVGICFDSTAKNVIDCPNP Q40379/1-194 SLKNHGIIRGYKYTVQKINNTIKTVTKGYPNLSCTKGQELWEVGICFDSTAKNVIDCPNP ::: :**: . *::**::**::** ****** :* *:****: *.::*** .* Q40236/1-193 KTCNPGEDNLIGFP Q40241/1-189 KTCNPGEDNLIGFP Q42513/1-193 KTCDTSGNTEIFFP G255586/1-194 KTCKTASNQGIMFP Q40379/1-194 KTCKTASNQGIMFP ***... : * **
Multiple sequence alignments (cont.) ( ( Q40236/1-193:-0.00066, Q40241/1-189:0.00066) :0.18460, Q42513/1-193:0.17928, ( G255586/1-194:0.00258, Q40379/1-194:0.00258) :0.32591);
Motifs - assigned to the secondary structure of a protein E.coli trp repressor
Leucine zipper motif L-X(6)-L-X(6)-L-X(6)-L
http://bioinf.man.ac.uk/dbbrowser/PRINTS/ “A fingerprint is a group of conserved motifs used to characterise a protein family”
Domains Many definitions – depends who you speak to! • Domains are discrete structural units • Defined by structure • Domain boundaries can be inferred from careful sequence analysis • Domains are the common currency of protein function
But – there are slightly more glutamates than aspartates in the alignment! EFGHIVW EYAHMIW DYAHSLW EFGHPLW [ED]- [FY]- [GA]- H- X- [VIL]- W And could X be represented more accurately by {FYW}?
EFGHIVW EYAHMIW DYAHSLW EFGHPLW So, let’s add some numbers to the problem! Positions One 15 5 0 0 0 0 0 0 0 0 0 0 0 0 Two 0 0 10 10 0 0 0 0 0 0 0 0 0 0 Three 0 0 0 0 0 10 10 0 0 0 0 0 0 0 Four 0 0 0 0 20 0 0 0 0 0 0 0 0 0 Five 2 2 -2 -2 2 2 2 2 2 2 2 2 2 -2 Six 0 0 0 0 0 0 0 5 0 0 0 10 5 0 Seven 0 0 0 0 0 0 0 0 0 0 0 0 0 20 E D F Y H G A I M S P L V W
M 1.0 I .50 0.75 0.75 E .75 D .25 F .50 Y .50 S 1.0 V .25 I .25 L .50 X 1.0 1.0 1.0 0.25 1.0 0.25 H 1.0 1.0 W 1.0 But…….profiles do not support gaps…. EFH-IIVW EYH--MIW DYHSISLW EFH-IPLW Hidden Markov Models introduce statistics into profiles
Pfam-A • 2,216 Curated families with annotation. • Pfam-B • 40,000 families derived from Prodom.