600 likes | 759 Views
Some gory details of protein secondary structure prediction. Burkhard Rost CUBIC Columbia University rost@columbia.edu http://www.columbia.edu/~rost http://cubic.bioc.columbia.edu/. HoMo. 1D ….the art of being humble. FoRc. Goal of secondary structure prediction.
E N D
Some gory details of protein secondary structure prediction Burkhard Rost CUBIC Columbia University rost@columbia.edu http://www.columbia.edu/~rost http://cubic.bioc.columbia.edu/ Burkhard Rost (Columbia New York)
HoMo 1D ….the art of being humble FoRc Burkhard Rost (Columbia New York)
Goal of secondary structure prediction Burkhard Rost (Columbia New York)
Secondary structure predictions of 1. and 2. generation • single residues (1. generation) • Chou-Fasman, GOR 1957-70/8050-55% accuracy • segments (2. generation) • GORIII 1986-9255-60% accuracy • problems • < 100% they said: 65% max • < 40% they said: strand non-local • short segments
Helix formation is local THYROID hormone receptor (2nll) Burkhard Rost (Columbia New York)
b-sheet formation is NOT local Burkhard Rost (Columbia New York)
Problems of secondary structure predictions(before 1994) SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD OBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE TYP EHHHHEE EEEE EEHHHEEEEEHH Burkhard Rost (Columbia New York)
Simple neural network Burkhard Rost (Columbia New York)
Training a neural network 1 Burkhard Rost (Columbia New York)
Training a neural network 2 2 Errare = (out net - out want) Burkhard Rost (Columbia New York)
Training a neural network 3 Burkhard Rost (Columbia New York)
Training a neural network 4 Burkhard Rost (Columbia New York)
Neural networks classify points Burkhard Rost (Columbia New York)
Simple neural network with hidden layer Burkhard Rost (Columbia New York)
Neural Network for secondary structure Burkhard Rost (Columbia New York)
Secondary structure predictions of 1. and 2. generation • single residues (1. generation) • Chou-Fasman, GOR 1957-70/8050-55% accuracy • segments (2. generation) • GORIII 1986-9255-60% accuracy • problems • < 100% they said: 65% max • < 40% they said: strand non-local • short segments Burkhard Rost (Columbia New York)
Balanced training normal training balanced training Burkhard Rost (Columbia New York)
PHDsec: structure-to-structure network Burkhard Rost (Columbia New York)
Better prediction of segment lengths Burkhard Rost (Columbia New York)
Evolution has it! Burkhard Rost (Columbia New York)
Spectrin homology domain (SH3) Burkhard Rost (Columbia New York)
Prediction accuracy varies! Burkhard Rost (Columbia New York)
Why so bad? Burkhard Rost (Columbia New York)
Stronger predictions more accurate! Burkhard Rost (Columbia New York)
Correct prediction of correctly predicted residues Burkhard Rost (Columbia New York)
BAD errors are frequent! Burkhard Rost (Columbia New York)
False prediction for engineered proteins! Burkhard Rost (Columbia New York)
PHDsec: the un-g(l)ory details • average accuracy > 72% (helix, strand, other) • 72% is average over distribution: ≈ 10% • stronger predictions more accurate • WARNING: reliability index almost factor 2 too large for single sequences Burkhard Rost (Columbia New York)
Details PHDsec: Multiple alignment • single sequences => accuracy clearly lower id nali Q3sec Q2acc AA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD OBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE 30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEE self 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH Burkhard Rost (Columbia New York)
PHDsec: the un-g(l)ory details • average accuracy > 72% (helix, strand, other) • 72% is average over distribution: ≈ 10% • stronger predictions more accurate • WARNING: reliability index almost factor 2 too large for single sequences Burkhard Rost (Columbia New York)
Details PHDsec: Multiple alignment • single sequences => accuracy clearly lower id nali Q3sec Q2acc AA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD OBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE 30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEE self 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH Burkhard Rost (Columbia New York)
Secondary structure prediction • Limit of prediction accuracy reached? • How complementing other methods? • Ultimate rôle in structure prediction (1D-3D)? • Better to use "pure" secondary structure prediction methods, or to use 3D methods and read the secondary structure off the 3D model? • Conversely, are 3D predictors making optimal use of secondary structure predictions? • Will secondary structure and 3D prediction merge completely? Burkhard Rost (Columbia New York)
Secondary structure prediction 2000 • history • 1st generation 50-55% • 2nd generation 55-62% • 3rd generation 1992 70-72% 2000 > 76% • what improves? • database growth +3 • PSI-BLAST +0.5 • new training +1 • ‘clever method’ +1 • limit? • max 88% -> 12% to go • 1/5 of proteins with more than 100 proteins-> >80% • and from there? Burkhard Rost (Columbia New York)
Prediction of protein secondary structure • 1980: 55% simple • 1990: 60% less simple • 1993: 70% evolution • 2000: 76% more evolution • what is the limit? • 88% for proteins of similar structure • 80% for 1/5th of proteins with families > 100 • missing through: better definition of secondary structure including long-range interactions • structural switches • chameleon / folding Burkhard Rost (Columbia New York)
CAFASP statistics • 29 proteins not similar to known PDB • T0086,T0087,T0090,T0091,T0092,T0094,T0095,T0096,T0097,T0098,T0101,T0102,T0104,T0105,T0106,T0107,T0108,T0109,T0110,T0114,T0115,T0116,T0117,T0118,T0120,T0124,T0125,T0126,T0127 • 2 proteins with PSI-BLAST homologue • T0089,T0103 • 9 proteins with trivial homologue to PDB • T0099,T0100,T0111,T0112,T0113,T0121,T0122,T0123,T0128 Burkhard Rost (Columbia New York)
CAFASP sec unique Burkhard Rost (Columbia New York)
CAFASP sec homologous Burkhard Rost (Columbia New York)
CAFASP concept • Targets & Non-targets • comparative modelling 85% > all current methods • Never compare methods on different proteins • Never rank when too few proteins • (Never show numbers for one protein between different proteins) Burkhard Rost (Columbia New York)
What is significant Burkhard Rost (Columbia New York)
Rank only if significant • e.g. M1 = 75, M2 = 73 • say 16 proteins • rule-of-thumb: significantsigma / sqrt(Number of porteins) • -> 10/4 = 2.5 -> M1 and M2 cannot be distinguished Burkhard Rost (Columbia New York)
EVA: automatic continuous EVAluation of structure prediction Burkhard Rost (Columbia New York)