750 likes | 944 Views
Protein Sequence Motifs. Aalt-Jan van Dijk Plant Research International, Wageningen UR Biometris , Wageningen UR aaltjan.vandijk@wur.nl. Plant Bioinformatics. Integrated analysis of omics datasets Transcriptomics Alternative splicing EST analysis Proteomics
E N D
Protein Sequence Motifs Aalt-Jan van Dijk Plant Research International, Wageningen UR Biometris, Wageningen UR aaltjan.vandijk@wur.nl
Plant Bioinformatics • Integrated analysis of omics datasets • Transcriptomics • Alternative splicing • EST analysis • Proteomics • Data (pre-)processing pipelining • Alternative splicing • Protein interactions networks • Metabolomics • Database- development • Data (pre-)processing pipelining • Metabolite and pathway-identification • Systems biology • network modelling (bottom-up) • Protein interactions networks • Genomics • Next Generation Sequencing • Genome assembly & annotation • (Comparative) genome analysis • SNP analysis, marker development • Technology • Computational infrastructure • Database development • Webbased analysis tools • Software- development • Workflow management systems • machine learning
My research • Protein complex structures • Protein-protein docking • Correlated mutations • Interaction site prediction/analysis • Protein-protein interactions • Protein-DNA interactions • Motif search • Enzyme active sites
Overview • Protein Motif Searching • Hydrophobicity & Transmembrane Domains • Protein Interactions • Sequence-motifs to predict interaction sites • Secondary Structure Prediction
What is a motif? • A motif is a description of a particular element of a protein that contains a specific sequence pattern • Motifs are identified by • 3D structural alignment • Multiple sequence alignment • Pattern searching programs
What is a motif? • A motif is a description of a particular element of a protein that contains a specific sequence pattern • Motifs are identified by • 3D structural alignment • Multiple sequence alignment • Pattern searching programs
Protein Motif Searching • Strict consensus pattern • useonlystrictlyconservedresidues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C C P C CxxxxxCxxxPxxxxxC
Protein Motif Searching • Strict consensus pattern • useonlystrictlyconservedresidues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C C P C CxxxxxCxxxPxxxxxC
Protein Motif Searching • Strict consensus pattern • use only strictly conserved residues • But what about: • variable residues? • gaps? C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C C P C CxxxxxCxxxPxxxxxC
Protein Motif Searching • Strict consensus patterns contain • no alternative residues • no flexible regions • no mismatches • no gaps C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C CxxxxxCxxxPxxxxxC C C P C
Protein Motif Searching • Most motifs defined as regular expressions • Motifs can contain • alternative residues • flexible regions C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C CXXXCXGXPXXXXXC | | | | | FGCAKLCAGFPLRRLPCFYG
The PROSITE Syntax • A-[BC]-X-D(2,5)-{EFG}-H • A • B or C • anything • 2-5 D’s • not E, F, or G • H
PROSITE entries • Mandatory motifs characterise a protein (super-) family ID SUBTILASE_ASP; PATTERN. DE Serine proteases, subtilase family, aspartic acid active site. PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH]. ID SUBTILASE_HIS; PATTERN. DE Serine proteases, subtilase family, histidine active site. PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM]. ID SUBTILASE_SER; PATTERN. DE Serine proteases, subtilase family, serine active site. PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG].
Exercise • Find the three subtilase motifs in prosite (prosite.expasy.org) • Compare the lists of proteins in which the motifs occur – what does this tell you? • Similarly, compare protein structures in which the motifs occur • Have a look at the “sequence logo”
Protein Motif Searching • Some motifs occur frequently in proteins; they may not actually be present, such as • Post-translational modification sites ID ASN_GLYCOSYLATION; PATTERN. DE N-glycosylation site. PA N-{P}-[ST]-{P}.
Exercise • Use a glycosylation site predictor such as http://www.cbs.dtu.dk/services/NetNGlyc/ • Input: your favorite set of sequences • Do you observe that some N-{P}-[ST] sites are likely to be glycosylated and others not?
Profiles • Many motifs cannot be easily defined using simple patterns • Such motifs can be defined using profiles • A profile is constructed from a multiple sequence alignment. For each position, each amino acid is given a score depending on how likely it is to occur
Calculating a Profile • For each alignment position: take the (weighted) average of the appropriate rows from the scoring matrix • An (extremelysimple) example: seq_01 A AAAAAAAAA W seq_02 A AAAAAAAA W W seq_03 A AAAAAAA W WW seq_04 A AAAAAA W WWW seq_05 A AAAAA W WWWW seq_06 A AAAA W WWWWW seq_07 A AAA W WWWWWW seq_08 A AA W WWWWWWW seq_09 A A W WWWWWWWW seq_10 A W WWWWWWWWW
Excerpt from the EBLOSUM62 matrix: A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 A C D E F G H I K L M 10A: 4.0 0.0 -2.0 -1.0 -2.0 0.0 -2.0 -1.0 -1.0 -1.0 -1.0 N P Q R S T V W Y -2.0 -1.0 -1.0 -1.0 1.0 0.0 0.0 -3.0 -2.0 A C D E F G H I K L M 5A+5W: 1.0 -2.0 -6.0 -4.0 -1.0 -2.0 -4.0 -4.0 -4.0 -3.0 -2.0 N P Q R S T V W Y -6.0 -5.0 -3.0 -4.0 -2.0 -2.0 -3.0 8.0 0.0 A C D E F G H I K L M 10W: -3.0 -2.0 -4.0 -3.0 1.0 -2.0 -2.0 -3.0 -3.0 -2.0 -1.0 N P Q R S T V W Y -4.0 -4.0 -2.0 -3.0 -3.0 -2.0 -3.0 11.0 2.0 prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62 matrix;
Pattern Searching • Short linear motifs: e.g. http://dilimot.russelllab.org/ • Profiles: meme http://meme.sdsc.edu/meme/cgi-bin/meme.cgi
Exercise Use a number of sequences wich contain the prositesubtilase motif and find motifs in those sequences with MEME
Hydropathy Plot Prediction hydrophobic and hydrophilic regions in a protein
Partition Coefficients Hydrophobic Hydrophilic Oil Water
Hydrophobicity/Hydrophilicity Values Fauchere & Pliska Kyte & Doolittle Hopp & Woods Eisenberg R -1.37 -4.50 3.00 -2.53 K -1.35 -3.90 3.00 -1.50 D -1.05 -3.50 3.00 -0.90 Q -0.78 -3.50 0.20 -0.85 N -0.85 -3.50 0.20 -0.78 E -0.87 -3.50 3.00 -0.74 H -0.40 -3.20 -0.50 -0.40 S -0.18 -0.80 0.30 -0.18 T -0.05 -0.70 -0.40 -0.05 P 0.12 -1.60 0.00 0.12 Y 0.26 -1.30 -2.30 0.26 C 0.29 2.50 -1.00 0.29 G 0.48 -0.40 0.00 0.48 A 0.62 1.80 -0.50 0.62 M 0.64 1.90 -1.30 0.64 W 0.81 -0.90 -3.40 0.81 L 1.06 3.80 -1.80 1.06 V 1.08 4.20 -1.50 1.08 F 1.19 2.80 -2.50 1.19 I 1.38 4.50 -1.80 1.38 hydrophilic hydrophobic
Hydrophobicity Plot • Sum amino acid hydrophobicity values in a given window • Plot the value in the middle of the window • Shift the window one position
Sliding Window Approach • Calculate property for first sub-sequence • Use the result (plot/print/store) • Move to next residue position, and repeat
Hydrophobicity Plot MEZCALTASTESVERYNICE
Hydrophobicity Plot MEZCALTASTESVERYNICE
Hydrophobicity Plot MEZCALTASTESVERYNICE
Hydrophobicity Plot MEZCALTASTESVERYNICE
Hydrophobicity Plot MEZCALTASTESVERYNICE
Hydrophobicity Plot MEZCALTASTESVERYNICE
Hydrophobicity Plot MEZCALTASTESVERYNICE
Transmembrane Regions Rotation is 100 degrees per amino acid Climb is 1.5 Angstrom per amino acid residue
Transmembrane Regions So we need approx. 30/1.5 = 20 amino acids to span the membrane 30 angstrom
Adapting the window size to the size of the membrane spanning segment makes the picture easier to interpret
window = 1 window = 9 window = 19 window = 121
Protein Interactions hemoglobin Obligatory
Protein Interactions hemoglobin Mitochondrial Cu transporters Obligatory Transient
Experimental approaches (1) Yeast two-hybrid (Y2H)
Experimental approaches (2) Affinity Purification + mass spectrometry (AP-MS)
Interaction Databases • STRINGhttp://string.embl.de/
Interaction Databases • STRINGhttp://string.embl.de/ • HPRDhttp://www.hprd.org/
Interaction Databases • STRINGhttp://string.embl.de/ • HPRDhttp://www.hprd.org/ • InteroPorchttp://biodev.extra.cea.fr/interoporc/Default.aspx • Many others…. E.g. see http://nar.oxfordjournals.org./content/39/suppl_1.toc