Protein Sequence Motifs

Protein Sequence Motifs Aalt-Jan van Dijk Plant Research International, Wageningen UR Biometris, Wageningen UR aaltjan.vandijk@wur.nl

Plant Bioinformatics • Integrated analysis of omics datasets • Transcriptomics • Alternative splicing • EST analysis • Proteomics • Data (pre-)processing pipelining • Alternative splicing • Protein interactions networks • Metabolomics • Database- development • Data (pre-)processing pipelining • Metabolite and pathway-identification • Systems biology • network modelling (bottom-up) • Protein interactions networks • Genomics • Next Generation Sequencing • Genome assembly & annotation • (Comparative) genome analysis • SNP analysis, marker development • Technology • Computational infrastructure • Database development • Webbased analysis tools • Software- development • Workflow management systems • machine learning

My research • Protein complex structures • Protein-protein docking • Correlated mutations • Interaction site prediction/analysis • Protein-protein interactions • Protein-DNA interactions • Motif search • Enzyme active sites

Overview • Protein Motif Searching • Hydrophobicity & Transmembrane Domains • Protein Interactions • Sequence-motifs to predict interaction sites • Secondary Structure Prediction

Protein Motif Searching

What is a motif? • A motif is a description of a particular element of a protein that contains a specific sequence pattern • Motifs are identified by • 3D structural alignment • Multiple sequence alignment • Pattern searching programs

Protein Motif Searching • Strict consensus pattern • useonlystrictlyconservedresidues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C C P C CxxxxxCxxxPxxxxxC

Protein Motif Searching • Strict consensus pattern • use only strictly conserved residues • But what about: • variable residues? • gaps? C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C C P C CxxxxxCxxxPxxxxxC

Protein Motif Searching • Strict consensus patterns contain • no alternative residues • no flexible regions • no mismatches • no gaps C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C CxxxxxCxxxPxxxxxC C C P C

Protein Motif Searching • Most motifs defined as regular expressions • Motifs can contain • alternative residues • flexible regions C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C CXXXCXGXPXXXXXC | | | | | FGCAKLCAGFPLRRLPCFYG

The PROSITE Syntax • A-[BC]-X-D(2,5)-{EFG}-H • A • B or C • anything • 2-5 D’s • not E, F, or G • H

PROSITE entries • Mandatory motifs characterise a protein (super-) family ID SUBTILASE_ASP; PATTERN. DE Serine proteases, subtilase family, aspartic acid active site. PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH]. ID SUBTILASE_HIS; PATTERN. DE Serine proteases, subtilase family, histidine active site. PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM]. ID SUBTILASE_SER; PATTERN. DE Serine proteases, subtilase family, serine active site. PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG].

Exercise • Find the three subtilase motifs in prosite (prosite.expasy.org) • Compare the lists of proteins in which the motifs occur – what does this tell you? • Similarly, compare protein structures in which the motifs occur • Have a look at the “sequence logo”

Protein Motif Searching • Some motifs occur frequently in proteins; they may not actually be present, such as • Post-translational modification sites ID ASN_GLYCOSYLATION; PATTERN. DE N-glycosylation site. PA N-{P}-[ST]-{P}.

Exercise • Use a glycosylation site predictor such as http://www.cbs.dtu.dk/services/NetNGlyc/ • Input: your favorite set of sequences • Do you observe that some N-{P}-[ST] sites are likely to be glycosylated and others not?

Profiles • Many motifs cannot be easily defined using simple patterns • Such motifs can be defined using profiles • A profile is constructed from a multiple sequence alignment. For each position, each amino acid is given a score depending on how likely it is to occur

Calculating a Profile • For each alignment position: take the (weighted) average of the appropriate rows from the scoring matrix • An (extremelysimple) example: seq_01 A AAAAAAAAA W seq_02 A AAAAAAAA W W seq_03 A AAAAAAA W WW seq_04 A AAAAAA W WWW seq_05 A AAAAA W WWWW seq_06 A AAAA W WWWWW seq_07 A AAA W WWWWWW seq_08 A AA W WWWWWWW seq_09 A A W WWWWWWWW seq_10 A W WWWWWWWWW

Excerpt from the EBLOSUM62 matrix: A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 A C D E F G H I K L M 10A: 4.0 0.0 -2.0 -1.0 -2.0 0.0 -2.0 -1.0 -1.0 -1.0 -1.0 N P Q R S T V W Y -2.0 -1.0 -1.0 -1.0 1.0 0.0 0.0 -3.0 -2.0 A C D E F G H I K L M 5A+5W: 1.0 -2.0 -6.0 -4.0 -1.0 -2.0 -4.0 -4.0 -4.0 -3.0 -2.0 N P Q R S T V W Y -6.0 -5.0 -3.0 -4.0 -2.0 -2.0 -3.0 8.0 0.0 A C D E F G H I K L M 10W: -3.0 -2.0 -4.0 -3.0 1.0 -2.0 -2.0 -3.0 -3.0 -2.0 -1.0 N P Q R S T V W Y -4.0 -4.0 -2.0 -3.0 -3.0 -2.0 -3.0 11.0 2.0 prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62 matrix;

Pattern Searching • Short linear motifs: e.g. http://dilimot.russelllab.org/ • Profiles: meme http://meme.sdsc.edu/meme/cgi-bin/meme.cgi

Exercise Use a number of sequences wich contain the prositesubtilase motif and find motifs in those sequences with MEME

Hydropathy Plot Prediction hydrophobic and hydrophilic regions in a protein

Partition Coefficients Hydrophobic Hydrophilic Oil Water

Hydrophobicity/Hydrophilicity Values Fauchere & Pliska Kyte & Doolittle Hopp & Woods Eisenberg R -1.37 -4.50 3.00 -2.53 K -1.35 -3.90 3.00 -1.50 D -1.05 -3.50 3.00 -0.90 Q -0.78 -3.50 0.20 -0.85 N -0.85 -3.50 0.20 -0.78 E -0.87 -3.50 3.00 -0.74 H -0.40 -3.20 -0.50 -0.40 S -0.18 -0.80 0.30 -0.18 T -0.05 -0.70 -0.40 -0.05 P 0.12 -1.60 0.00 0.12 Y 0.26 -1.30 -2.30 0.26 C 0.29 2.50 -1.00 0.29 G 0.48 -0.40 0.00 0.48 A 0.62 1.80 -0.50 0.62 M 0.64 1.90 -1.30 0.64 W 0.81 -0.90 -3.40 0.81 L 1.06 3.80 -1.80 1.06 V 1.08 4.20 -1.50 1.08 F 1.19 2.80 -2.50 1.19 I 1.38 4.50 -1.80 1.38 hydrophilic hydrophobic

Hydrophobicity Plot • Sum amino acid hydrophobicity values in a given window • Plot the value in the middle of the window • Shift the window one position

Sliding Window Approach • Calculate property for first sub-sequence • Use the result (plot/print/store) • Move to next residue position, and repeat

Hydrophobicity Plot MEZCALTASTESVERYNICE

Transmembrane Regions Rotation is 100 degrees per amino acid Climb is 1.5 Angstrom per amino acid residue

Transmembrane Regions So we need approx. 30/1.5 = 20 amino acids to span the membrane 30 angstrom

Adapting the window size to the size of the membrane spanning segment makes the picture easier to interpret

window = 1 window = 9 window = 19 window = 121

Protein Interactions

Protein Interactions hemoglobin Obligatory

Protein Interactions hemoglobin Mitochondrial Cu transporters Obligatory Transient

Experimental approaches (1) Yeast two-hybrid (Y2H)

Experimental approaches (2) Affinity Purification + mass spectrometry (AP-MS)

Interaction Databases • STRINGhttp://string.embl.de/

Interaction Databases

Interaction Databases • STRINGhttp://string.embl.de/ • HPRDhttp://www.hprd.org/

Interaction Databases

Interaction Databases • STRINGhttp://string.embl.de/ • HPRDhttp://www.hprd.org/ • InteroPorchttp://biodev.extra.cea.fr/interoporc/Default.aspx • Many others…. E.g. see http://nar.oxfordjournals.org./content/39/suppl_1.toc

Yeast protein interaction network

Protein Sequence Motifs

Protein Sequence Motifs

Presentation Transcript

Protein Sequence Analysis - Overview

Sequence motifs

Protein Sequence Databases

PROTEIN SEQUENCE ANALYSIS

Identification of protein-protein binding motifs

Protein Motifs

Local Multiple Sequence Alignment Sequence Motifs

Protein sequence analysis

Protein Motifs: EGF Domains

Protein Primary Sequence

Protein Sequence

Protein Folding Initiation Site Motifs

Protein Motifs-RNA Binding Domains

Sequence motifs, information content, and sequence logos

Sequence Motifs

Motifs, Motifs, Motifs

Protein sequence databases

Sequence motifs, information content, and sequence logos

Sequence motifs, and sequence logos, Neural networks

Protein Sequence Motifs

Protein Sequence Analysis - Overview

Protein Primary Sequence