Protein sequence analysis

Protein sequence analysis Xu Cheng

Knowing what you must about domains, HMMs, profiles, and the Pfam • domain collectionVisiting the three most popular sites for finding domains in your protein • Predicting simple physical properties of your sequences • Predicting protease digestion patternsPredicting coiled-coil domains • Predicting post-translational modifications

Predicting the main physico-chemicalproperties of a protein • ProtParam:Physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, isoelectric point, extinction coefficient, etc.)

Digesting a protein in a computer • eparate the domains in your protein • Identify potential post-translational modification by mass spectrometry • Remove a tag protein when you express a fusion protein • Make sure that the protein you’re cloning isn’t sensitive to some endoge-nous proteases • Available from the ExPASy Web site at PeptideCutter www.expasy.org/tools/#proteome

Doing Primary Structure Analysis • Hydrophobic regions that could be membrane-spanning segments in pro-teins that anchor themselves into a membrane • Coiled-coil regions that indicate potential protein-protein interaction • Hydrophilic stretches that could be looping out at the surface of the protein

Sliding windows • The “sliding windows” technique is the most ancient way of looking at sequences. The principle is very simple. What you need is a chemical property and a list of values associated with each of the 20 amino acids. This property can be any measurable physico-chemical parameter, such as size, polarity, hydrophobicity, or even the propensity of amino acids to be in a specific structural state. The values in this table are the amino acids’ scale values. Many such tables exist that have been determined experimentally for almost any characteristic you can think of.

Looking for transmembrane segments • ProtScale uses a sliding-window technique and one of many amino-acid scale values. In this example, we use the hydrophobicity to identify the groups of hydrophobic segments that characterize transmembrane proteins. ProtScale doesn’t predict anything for you; it returns a hydrophobicity profile and lets you do the interpretation. • TMHMM is a state-of-the-art program that predicts transmembrane segments in your protein. TMHMM also tells you about the portions of your protein that are probably inside the cell and those that are probably outside.

Looking for coiled-coil regions • Coiled-coil regions are portions of a protein formed by the intertwining of two or three alpha-helices. One reason it’s considered interesting to find coiled-coil regions is that they’re often involved in protein-protein interactions. • Another (less glorious) reason is that these coiled-coil regions can give false matches when you do a database search— and it can be a good thing to filter them out. If you want to predict these regions in your protein of interest, you can use the conveniently named COILS server at EMBnet www.ch.embnet.org/software/COILS_form.html

Predicting Post-TranslationalModifications in Your Protein • These modifications may involve adding sugars, modifying amino acids, or removing pieces of the newly synthesized protein. • This may be very important if you want to clone and express a human protein in bacteria — because, in order to be active, your protein may require some post-translational modifications that the bacterium itself cannot make. • http://www.expasy.org/tools/#ptm

Looking for PROSITE patterns • ScanProsite:www.expasy.org/tools/scanprosite/

Finding Known Domains in Your Protein • a domain is a portion of protein that can keep its shape • InterProScan • CD-Search • Motif-Scan

Finding domains with InterProScan • www.ebi.ac.uk/InterProScan/.

Finding domains with the CD server • The main advantage of the CD server is that reported hits come with a score that helps you discriminate the good from the spurious matches. • www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

Finding domains with Motif Scan • Motif Scan includes some domains that have not yet been released officially via InterPro. • myhits.isb-sib.ch/cgi-bin/motif_scan

Epitope prediction • Antibodies are produced by B lymphocytes (B cells) • Antibodies circulate in the blood • They are referred to as “the first line of defense” against infection • Antibodies play a central role in immunity by attaching to pathogens and recruiting effector systems that kill the invader

What is a B cell epitope? • Antibodies are developed to bind the epitope with high affinity by using the complementarity determining regions (CDRs)

Motivations for prediction of B cell epitopes • Prediction of B cell epitopes can potentially guide experimental epitope mapping • Predictions of antigenicity in proteins can be used for selecting subunits in rational vaccine design • Predictions of B cell epitopes may also be valuable for interpretation of results from experiments based on antibody affinity binding such as ELISA, RIA

Computational Rational Vaccine Design

B cell epitopes, linear or discontinuous? • Classified into linear (~10%) and discontinuous epitopes (~90%) • Databases: AntiJen, IEDB, BciPep, Los Alamos HIV database, Protein Data Bank • Large amount of data available for linear epitopes • Few data available for discontinuous epitopes • In general, B cell epitope prediction methods have relatively low performances

Discontinuous B cell epitopes

The binding interactions

B-cell epitope data bases • Databases: AntiJen, IEDB, BciPep, Los Alamos HIV database, Protein Data Bank • Large amount of data available for linear epitopes • Few data available for discontinuous

Sequence-based methods for prediction of linear epitopes • Protein hydrophobicity – hydrophilicity algorithms Parker, Fauchere, Janin, Kyte and Doolittle, Manavalan Sweet and Eisenberg, Goldman, Engelman and Steitz (GES), von Heijne • Protein flexibility prediction algorithm Karplus and Schulz • Protein secondary structure prediction algorithms GOR II method (Garnier and Robson), Chou and Fasman, Pellequer • Protein “antigenicity” prediction : Hopp and Woods, Welling

Propensity scales: The principle

Evaluation of performance • A Receiver Operator Curve (ROC) is useful for finding a good threshold and rank methods

Turn prediction and B-cell epitopes • Pellequer found that 50% of the epitopes in a data set of 11 proteins were located in turns • Turn propensity scales for each position in the turn were used for epitope prediction.

BepiPred: CBS in-house tool • Parker hydrophilicity scale • Hidden Markov model • Markov model based on linear epitopes extracted from the AntiJen database • Combination of the Parker prediction scores and Markov model leads to prediction score • Tested on the Pellequer dataset and epitopes in the HIV Los Alamos database • www.cbs.dtu.dk/services/BepiPred

Protean • Several tools integrated • Easy to handle

Protein sequence analysis

Protein sequence analysis

Presentation Transcript

From Protein Sequence to Function: Functional Analysis of Protein Sequences and Protein Classification

Protein Sequence Analysis - Overview

Bioinformatics and Protein Sequence Analysis

Protein Sequence Databases

Protein Sequence-Structure-Function

Sequence analysis

PROTEIN SEQUENCE ANALYSIS

Protein Sequence Analysis - Overview

Protein Sequence Analysis - Overview -

Protein Primary Sequence

Protein Sequence

Day 1b: Protein Sequence Analysis

Recent Advances in Protein Sequence Analysis

B. Protein sequence alignment

Protein Evolution and Sequence Analysis

Protein sequence databases

SEQUENCE ANALYSIS

Protein Sequence Motifs

Protein Evolution and Sequence Analysis

SEQUENCE ANALYSIS

Protein Sequence Analysis - Overview

Protein Primary Sequence