1.5k likes | 1.7k Views
Protein Analysis. Peptide MappingStructural/Functional MotifsSecondary Structure Prediction. Motifs. Identification of Functional Domains. Identifying Functional Protein Domains. Search protein sequences with a database of defined functional motifsMotifs are derived by aligning peptide regions wh
E N D
1. Protein Analysis
2. Protein Analysis Peptide Mapping
Structural/Functional Motifs
Secondary Structure Prediction
3. Motifs Identification of Functional Domains
4. Identifying Functional Protein Domains Search protein sequences with a database of defined functional motifs
Motifs are derived by aligning peptide regions which have been shown to have common function
A sequence specification is derived from the alignment which can be used to search for similar motifs in other protein sequences
5. Motif Sequence Specifications The sequence specification is the same as for FindPatterns. This is used as a consensus pattern in a search
Motifs
The sequence specification may also be defined as a profile constructed from a set of aligned sequences and used as a part of a library of profiles in a search
ProfileScan
6. Pattern Definitions Findpatterns, Map, Mapsort, Mapplot, and Motifs all let you search with ambiguous expressions
Expressions can include any legal GCG sequence character
Expressions can also specify:
OR and NOT matching
Begin and end constraints
Repeat counts
7. TAATA(N){20,30}ATG TAATA, followed by 20 to 30 of any base, followed by ATG
8. Repeats Parentheses () enclose one or more symbols that can be repeated
Braces {} enclose numbers that tell how many times the symbol(s) must be found
(GA){2,10} - GA repeated 2 to 10 times
G{2,} - G repeated 2 to 350,000 times
(GAT){,10} - GAT repeated 0 to 10 times
9. OR Matching Enclose the different choices in parentheses and separate the choices with commas
RGF(Q,A)S
RGF followed by either Q or A followed by S.
GAT(TG,T,G){1,4}A means
GAT followed by any combination of TG, T, or G repeated from 1 to 4 times followed by A
10. NOT Matching Use the ~ symbol
GC~CAT
GC, followed by any symbol except C followed by AT
GC~(A,T)CC
GC followed by any symbol except A or T, followed by CC.
11. BEGIN AND END Constraints The pattern <GACCAT can only be found at the beginning of the sequence
The pattern GACCAT> can only be found at the end of the sequence
12. Motifs Uses the Prosite dictionary of peptide motifs to search for occurrences of each motif in a query sequence
13. Prosite Dictionary of protein sites and patterns
http://www.expasy.ch/prosite/
Distributed by EMBL and maintained by Dr. Amos Bairoch at the University of Geneva
Release 16.35; 13-Apr-2001
1,462 motif descriptions
GCG at release 16, 7/1999
14. Prosite Files Site name
Site Description
The sequence motif in FindPatterns format
An abstract file describing the motif along with references
15. Restrictions Patterns are limited to 350 characters
Motifs does not introduce gaps
Mismatches can be tolerated with /Mis=n
22. ProfileScan Uses a database of profiles to scan query sequences for matching structural motifs
New profiles can be created with ProfileMake
23. Validated Profiles Profiles derived from a group of sequences aligned at a common functional domain
All sequences used to create the profile correctly align to the profile
All sequences known to contain the motif score above the high level
Supplied profiles validated by Dr. Michael Gribskov
San Diego Supercomputing Center
24. Profile List In ProfileDIR
629 different profiles
analyze% to profiledir
analyze% more profilename.prf
to see documentation
35. Isoelectric Plots the charge as a function of pH for any protein
42. CoilScan Locates coiled-coil motifs
Involved in protein-protein interactions
Uses weight-matrix from known coiled-coil structures to search for matching structures
Locates solvent exposed coiled-coils
parallel and antiparallel two-stranded coiled-coils
parallel three-stranded coiled-coils
43. Coiled-Coil Structures Bundles of two or more alpha helices that are supercoiled together
Each alpha helix in a coiled coil is strongly amphipathic
The pattern of hydrophilic and hydrophobic amino acids repeats every seven residues
Five of the seven residue positions in the coiled-coil heptad repeat are hydrophilic
1 and 4 are hydrophobic
44. CoilScan Settings Use the largest window length (28) for predicting new coiled-coil segments
Higher resolution
Use smaller window sizes to identify the ends of the coiled-coil segment with greater precision.
-weight increases the weighting of the hydrophobic residues allowing less chance of detecting highly charged sequences
50. HTHscan Locate helix-turn-helix motifs
Signature of DNA binding structures
Gene regulation
Uses a weight-matrix of known H-T-H structures
AraC (bacterial regulatory helix-loop-helix proteins)
LysR (bacterial regulatory helix-loop-helix proteins)
homeobox domains
55. SPScan Locate secretory signal peptide motifs
Available weight matrices:
Eukaryotes
Gram-positive prokaryotes
Gram-negative prokaryotes
56. SPScan Calculation von Heijne's weight matrix method
McGeoch's criteria
Scan entire protein sequence for potential starting points
Only methionines are considered SP starting points
57. SPScan Calculation Identify n-region/charged region
11 or fewer residues containing at least one charged amino acid residue
R, K, or E
Identify h-region/uncharged region
Hydrophobicity of >=15
Kyte-Doolittle 8-residue Window
Score with Weight Matrix
62. Secondary Structure Prediction
63. Protein Secondary Structure Predictions The primary sequence of a protein contains the information necessary to predict higher order interactions among the constituent groups of that protein
Once the rules are known, we can let the computer do the work
What are the rules?!!!!
64. Considerations Many of the measures used in structure prediction have been derived empirically
The data has been obtained from proteins whose structure has been determined by X-ray crystallography.
The dataset is limited
Very little "hard" data is used to derive the mathematical formulas and constants used to make the predictions.
65. Chou-Fasman Predictions Applies to soluble (globular) proteins.
Derive numerical value for each amino acid reflecting its conformational preference.
Numbers derived empirically.
66. Rules Derive "arbitrary" rules to determine local peptide conformation based on the amino acid sequence and the secondary structure propensity for each amino acid.
67. Helices Cluster of 4 helical residues out of 6 will nucleate a helix.
Pro, Asp, Glu are at the amino terminal ends of a structure
Pro can only occur in the first three amino acids
68. Helices His, Lys, and Arg appear at the carboxy terminus of a helix
Structure continues until alpha tetrapeptide breakers with P-alpha falls below 1.0
69. Beta Sheets 3/5 beta formers nucleate a sheet until beta tetrapeptide breakers with P-beta < 1.00 are reached
Regions with both alpha and beta formers
Helical if P-alpha > P-beta
Sheet if P-beta > P-alpha
70. Turns Turns based on tables giving the frequencies of all 20 amino acids in a 4-residue bend, and the P turn values for each amino acid.
71. Garnier-Osguthorpe-Robson Conformational state of a particular residue is determined not only by the empirical values of the residue itself, but also include the values of 8 residues on each side of that particular residue.
Cooperative effects are included at the outset.
Fewer arbitrary rules as to structure formation.
72. Including Biological Data Can include decision constants to weight (and therefore improve) the prediction when information as to the amount of helix and sheet structures is known.
Not available in GCG
Circular dichroism
Raman spectroscopy
73. Accuracy of the Predictions Q-score; percentage of residues placed in the correct structural class.
Three state predictions; Alpha, Beta, Coil
Random = 33%
Four state predictions include turns
Random = 25%
74. Observed Accuracy Using structures known at the time the predictive methods were compiled (late 1970s), a Q score of approximately 68% was obtained (three state).
Analysis of protein structures determined since then gave Q scores of around 55% for any of the predictive schemes (three state).
75. Observed Accuracy Four state predictions gave Q scores of around 45%.
Internal beta sheets more accurately predicted than external sheets.
Q drops from 60 to 20% (three state).
76. Hydropathic Plots Determine antigenic regions
Determine membrane spanning regions
Determine protein folding patterns; Residues on the interior vs. exterior of the protein.
77. Hopps-Woods Measure directed towards the determination of antigenic sites.
Values for each amino acid derived from published values of the partitioning of individual amino acids between solvents.
Ethanol-Water
Ethanol chosen as a solvent which might resemble the solvent phase on the interior of a protein.
Ethanol may not be the best solvent choice.
78. Hopps-Woods The values were altered for some amino acids based on empirical data to better reflect an association between hydrophilicity and known antigenic sites.
Proline; -1.4 to 0
Aspartic acid; 2.5 to 3
Glutamic acid; 2.5 to 3
79. Hopps-Woods Window Window of six was chosen as the approximate size of an antigenic determinant.
GCG uses a default of 7
80. Kyte-Doolittle Use hydrophilicity values for individual amino acids based upon water-vapor transfer free energies.
Ethanol is not a neutral, non-interacting solvent.
Also use empirical data based on the partitioning of individual amino acids to the exterior or interior of proteins with known structures.
81. Kyte-Doolittle Included subjective assessments as to the hydropathic character of any individual amino acid.
Best window found to be 7 - 11 residues lowering the noise without smoothing out significant peaks.
82. Flexibility Flexibility measure determined from B-value of C-alpha atoms of the individual amino acids
Temperature Factor reflecting the flexibility constraint on the alpha-Carbon.
Derive a formula empirically to fit known flexibility data from crystal structures
Flexibility can be severely constrained by tertiary interactions.
S-S bonds
83. Surface Probability Utilizes empirical determination of which amino acids were found to reside at the surface of proteins with known structure.
84. Predicting Antigentic Sites Surface features with high degree of exposure to the solvent.
Hydrophilic
Regions of high numbers of turns.
Determination of the most probable antigenic sites allows some predictive value for the most likely synthetic peptides to use for making antibodies
85. Antigenic Index Measure combining all of the above data;
Note that many of these measures are derived from similar empirical data.
AI= 0.3*H + 0.15*S + .15*F + 0.2*C + 0.2*G
C; G: Chou-Fasman and Garnier turn predictions
86. N-Glycosylation Sites Asn-X-Ser
Asn-X-Thr
Only minor probability when X=Asp, Trp, or Pro.
87. The Prediction Programs
88. Moment Calculates the hydrophobic moment for a peptide.
May be predictive of amphipathic alpha helical or beta sheet structures.
89. Moment Calculation Calculates the hydrophobicity of one side of the peptide chain as the amino acid residues are rotated through 180?
Plots a contour graph of the hydophobicity versus the angle of rotation for each residue (or window of residues)
90. Moment Predictions Alpha Helices show a typical rotation per residue of 100?
A significant increase in hydrophobicity in the contour plot at 100? would indicate the possible existence of an amphipathic helix for the corresponding residues.
91. Moment Predictions Beta Sheets show a rotation of approx. 160?
Significant amphipathic hydrophobicity at a rotation of 160? would then indicate the possibility of a beta sheet for the corresponding residues.
93. HelicalWheel Plots the amino acids of a peptide sequence along a helical wheel in order to recognize regions of amphipathic helices
Residues are plotted at 100? offsets from the preceding residue
98. PepPlot Plots of Various Computer Predictions of Peptide Structure
99. PepPlot Sequence
Charged-polar-hydrophobic residues
Beta forming-breaking residues
Chou-Fasman alpha-beta prediction
Alpha forming-breaking residues
100. PepPlot Chou-Fasman amino-end predictions
Chou-Fasman carboxy-end predictions
Chou-Fasman turn predictions
Helical Hydrophobic Moment plot for alpha and beta
Hydropathy and hydrophilicity
108. Panel A - The Sequence
110. Panel b - Residue Schematic Hydrophilic, charged (Green)
down = acidic
up = basic
Hydrophilic, uncharged (Red)
short = amides
long = alcohols
Hydrophobic (Blue)
short = aliphatic
long = aromatic
Proline (Black)
Alanine, Glycine, Cysteine (UnMarked)
112. Panel c - Beta Forming and Breaking Residues Chou-Fasman rules indicating amino acids which tend to form or break beta sheet structures.
114. Panel d - Alpha and Beta Prediction Curves Chou-Fasman rules indicating the propensity of the sequence to form an alpha helix or a beta sheet.
116. Panel e - Alpha Forming and Breaking Residues Chou-Fasman rules indicating amino acids which tend to form or break alpha helical structures.
118. Panel f - Amino End Association Chou-Fasman rules indicating amino acids which tend to be present at the amino ends of an alpha or beta structure.
120. Panel g - Carboxy end association Chou-Fasman rules indicating amino acids which tend to be present at the carboxy ends of an alpha or beta structure.
122. Panel h - Chou-Fasman Turn Predictions Chou-Fasman prediction of the likelihood of a turn.
124. Panel i - Helical Hydrophobic Moment Eisenberg's hydrophobic moment prediction of the likelihood of the presence of an amphiphilic structure.
Alpha helix - Plot of HM maximum for 90? - 110? of rotation using a window of eight residues.
Beta sheet - Plot of HM maximum for 140? - 180? of rotation using a window of six residues.
126. Panel j - Hydropathy and Hydrophilicity Plot Kyte and Doolittle (black curve); Average hydrophobicity over a window of nine residues.
127. GES Goldman, Engleman, and Steitz Transbilayer Helices (green curve)
Identification of nonpolar transbilayer helices over a window of 20 residues. Based on possible lipid-protein interactions and the helical arrangement of the hydrophobic residues.
129. PeptideStructure Compiles various pieces of information concerning protein structure for display using PlotStructure
134. PlotStructure Display of PeptideStructure Results
139. VSV G 1-100 CF
140. VSV G 1-100 G
141. VSV G
142. VSV G Garnier
144. VSV G Surface Probability
145. VSV G Flexibility
146. VSV G Antigenic Index
147. Flu HA
148. Towards the Holy Grail… Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction
http://predictioncenter.llnl.gov/
Structure prediction of proteins whose coordinates are not yet publicly available
149. CASP4 – Asilomar, 12/2000 Are the models produced similar to the corresponding experimental structure?
Is the mapping of the target sequence onto the proposed structure (i.e. the alignment) correct?
Have similar structures that a model can be based on been identified?
Are the details of the models correct?
Has there been progress from the earlier CASPs?
What methods are most effective?
Where can future effort be most productively focused?
150. Web-based Prediction Tools ExPASy
Swiss Institute of Bioinformatics
http://www.expasy.ch/
BCM Search Launcher
Baylor College of Medicine
http://searchlauncher.bcm.tmc.edu/
The PredictProtein server
Columbia University
http://maple.bioc.columbia.edu/
151. Next Up Sequence Comparison