1 / 25

Predicting Structural Features

Predicting Structural Features. Chapter 12. Structural Features. Phosphorylation sites Transmembrane helices Protein flexibility. Accuracy Measures Revisited. Level: Individual residues Complete helix or strand. Residue-Level Measures. Q 3 Percentage of residues predicted correctly

corinec
Download Presentation

Predicting Structural Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Structural Features Chapter 12

  2. Structural Features • Phosphorylation sites • Transmembrane helices • Protein flexibility

  3. Accuracy Measures Revisited • Level: • Individual residues • Complete helix or strand

  4. Residue-Level Measures • Q3 • Percentage of residues predicted correctly • If one state (eg, Coil) is very common (eg, 50%), blind guessing can give a large Q3! • Matthew’s correlation coefficient • C= (TPxTN - FNxFP)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN) • Defined for each state • More balanced than Q3; in range ±1 • Random prediction: C = 0

  5. Structural Element-Level Measures • SOV • based on the overlap of predicted “segments” of helix, strand etc. with the observed segments of the same type • The N-score • specialized for transmembrane protein predictors • Should TMHMM2 be changed? Should your model?

  6. Predicting Helices • Residue propensities: • score for a given structure class for each residue, a • P(H | a) is proportional to P(a | H) / P(a) • Why? Bayes’ Rule is your friend! • P(H | a) = P(a | H)P(H) / P(a) • P(H) doesn’t depend on a, so • P(H | a) proportional to P(a | H) / P(a) Can this be used to see how to group helix states?

  7. Identical short segments rarely fold differently • Local sequence is highly important to secondary structure. • But, this sequence occurs in two proteins and takes very different forms: • KGVVPQLVK • There is significant information about structure in local sequence.

  8. I-sites Sequence Database • About 250 short segments (3-19 residues) that show strong correlation between sequence and structure • Example shows: • phi and psi angles, log-odds matrix • superimposed backbones • representative structure

  9. Nearest Neighbor Prediction Methods • Predict secondary structure based on: • Local alignments of the query sequence to a database of sequences of known structure • Alignment score functions are often special-purpose, and may include helix/sheet/coil “propensity” information • Homologous sequences are often included in the database • Prediction based on weighted votes of nearest neighbors (usually only central residue of alignment is predicted) • 73.5% Accuracy (Q3)

  10. A different application: prediction of misfolding • Diseases such as Alzheimer’s involve protein misfolding. • Usually, the misfolded region ends up as Beta-strands. • How could we use secondary structure information to predict which proteins will potentially misfold?

  11. HPHidden Beta Propensity • Key idea: Tertiary contacts (TC) • TC is number of contacts a residue has with others at least 4 residues away • Alpha helices tend to be in regions of HIGH TC • Beta strands tend to be in regions of LOW TC • Look for query residues whose nearest neighbors are “strange” with respect to TC and alpha/beta state: • Low TC regions with lots of Alphas • High TC regions with lots of Betas • Performance results?

  12. Neural Nets • Each node computes a simple function of its inputs. • The weighted sum of the inputs are added to a bias term and “squashed”: • I =  w-1 • (I+) • The output, , is then propagated to nodes in the next layer.

  13. Training Neural Nets • Back-propagation • Optimizes the weights and bias terms • Minimize the error function (difference between predicted and observed) • RMS • Relative Entropy • Iterative process • Final weights shown for a secondary structure NN alpha helix output layer. • Over-fitting can be reduced by training for fewer iterations

  14. Adaptive Encoding and Weight Sharing • Orthogonal encoding • Each residue feeds three hidden nodes • The weights for all red nodes are tied together • Each group of three nodes learns the same “encoding” of the 20 amino acids

  15. Engineering Intuition Into NNs • Alpha helices have a period of 3.6 residues per turn • A NN can be specially designed to reflect that • Using this, plus adaptive encoding: • Q3 = 66% • Adding homology: Q3 = 73%

  16. HMMs and Transmembrane Proteins (again)

  17. HMMTOP Architecture • TMHs 17-25 residues • Tails 1-15 residues • Blue letters show structural state labels

  18. TMHMM Architecture • Helices are 5-25 residues • Caps follow helices • Cytoplasmic: • Loop: 0-20 residues • Globular: 1 state • Extra-cellular: • Long loop: 0-100 residues • Globular: 3 states

  19. Predicting Globular Proteins with “Hidden Neural Networks” • YASPIN • Neural net predicts seven classes (He,H, Hb,C,Ee,E,Eb) using 15-residue window of PSSM input • HMM “filters” this output • Can you imagine how this is done?

  20. Coiled-coil HMMMARCOIL Design lets you start and end in any phase of the heptad repeat

  21. Support Vector Machines: SVMs • Classifiers • Basic “machine” is a 2-class classifier • Training Data • set of labeled vectors • {<x1, x2, …,xn, C>}, • Class: C=1 or C=-1 • Supervised learning (like neural nets) • Learn from positive and negative examples • Output • Function predicting class of unlabeled vectors

  22. SVM Example • Alpha helix predictor • 15 residue window • 21 numbers per residue • Psi-BLAST PSSM: 20 numbers • “spacer” flag indicating “off end” of protein • 315 numbers total per window • Training samples • Non-helix samples: {<x1, x2, …, x315, -1>} • Helix samples: {<x1, x2, …, x315, 1>} • Training finds function of X that best separates the non-helix from the helix samples

  23. SVM vs NNas Classifiers • Similarities • Compute a function on their inputs • Trained to minimize error • Differences • NNs find any hyperplane that separates the two clases • SVMs find the maximum- margin hyperplane • NNs can be engineered by designing their topology • SVMs can be tailored by designing the kernel function

  24. SVM Details Separating Hyperplanes: Choose w, b to minimize ||w|| Subject to Dual form (support vectors) Kernel trick: replace dot products by a non-linear kernel bunction. s.t. where

  25. Dubious Statement • “In marked contrast to NN, SVMs have few explicit parameters to fit…” • The vector of weights, w, is as long as the number of training samples • But the minimum-margin hyperplane will have most of the weights equal to zero; only the “support vectors” will have non-zero weights.

More Related