250 likes | 261 Views
Dive into predicting structural features like phosphorylation sites & protein flexibility using accuracy metrics like Q3 & SOV. Explore how secondary structure information can be utilized for predicting disease-related misfolding in proteins.
E N D
Predicting Structural Features Chapter 12
Structural Features • Phosphorylation sites • Transmembrane helices • Protein flexibility
Accuracy Measures Revisited • Level: • Individual residues • Complete helix or strand
Residue-Level Measures • Q3 • Percentage of residues predicted correctly • If one state (eg, Coil) is very common (eg, 50%), blind guessing can give a large Q3! • Matthew’s correlation coefficient • C= (TPxTN - FNxFP)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN) • Defined for each state • More balanced than Q3; in range ±1 • Random prediction: C = 0
Structural Element-Level Measures • SOV • based on the overlap of predicted “segments” of helix, strand etc. with the observed segments of the same type • The N-score • specialized for transmembrane protein predictors • Should TMHMM2 be changed? Should your model?
Predicting Helices • Residue propensities: • score for a given structure class for each residue, a • P(H | a) is proportional to P(a | H) / P(a) • Why? Bayes’ Rule is your friend! • P(H | a) = P(a | H)P(H) / P(a) • P(H) doesn’t depend on a, so • P(H | a) proportional to P(a | H) / P(a) Can this be used to see how to group helix states?
Identical short segments rarely fold differently • Local sequence is highly important to secondary structure. • But, this sequence occurs in two proteins and takes very different forms: • KGVVPQLVK • There is significant information about structure in local sequence.
I-sites Sequence Database • About 250 short segments (3-19 residues) that show strong correlation between sequence and structure • Example shows: • phi and psi angles, log-odds matrix • superimposed backbones • representative structure
Nearest Neighbor Prediction Methods • Predict secondary structure based on: • Local alignments of the query sequence to a database of sequences of known structure • Alignment score functions are often special-purpose, and may include helix/sheet/coil “propensity” information • Homologous sequences are often included in the database • Prediction based on weighted votes of nearest neighbors (usually only central residue of alignment is predicted) • 73.5% Accuracy (Q3)
A different application: prediction of misfolding • Diseases such as Alzheimer’s involve protein misfolding. • Usually, the misfolded region ends up as Beta-strands. • How could we use secondary structure information to predict which proteins will potentially misfold?
HPHidden Beta Propensity • Key idea: Tertiary contacts (TC) • TC is number of contacts a residue has with others at least 4 residues away • Alpha helices tend to be in regions of HIGH TC • Beta strands tend to be in regions of LOW TC • Look for query residues whose nearest neighbors are “strange” with respect to TC and alpha/beta state: • Low TC regions with lots of Alphas • High TC regions with lots of Betas • Performance results?
Neural Nets • Each node computes a simple function of its inputs. • The weighted sum of the inputs are added to a bias term and “squashed”: • I = w-1 • (I+) • The output, , is then propagated to nodes in the next layer.
Training Neural Nets • Back-propagation • Optimizes the weights and bias terms • Minimize the error function (difference between predicted and observed) • RMS • Relative Entropy • Iterative process • Final weights shown for a secondary structure NN alpha helix output layer. • Over-fitting can be reduced by training for fewer iterations
Adaptive Encoding and Weight Sharing • Orthogonal encoding • Each residue feeds three hidden nodes • The weights for all red nodes are tied together • Each group of three nodes learns the same “encoding” of the 20 amino acids
Engineering Intuition Into NNs • Alpha helices have a period of 3.6 residues per turn • A NN can be specially designed to reflect that • Using this, plus adaptive encoding: • Q3 = 66% • Adding homology: Q3 = 73%
HMMTOP Architecture • TMHs 17-25 residues • Tails 1-15 residues • Blue letters show structural state labels
TMHMM Architecture • Helices are 5-25 residues • Caps follow helices • Cytoplasmic: • Loop: 0-20 residues • Globular: 1 state • Extra-cellular: • Long loop: 0-100 residues • Globular: 3 states
Predicting Globular Proteins with “Hidden Neural Networks” • YASPIN • Neural net predicts seven classes (He,H, Hb,C,Ee,E,Eb) using 15-residue window of PSSM input • HMM “filters” this output • Can you imagine how this is done?
Coiled-coil HMMMARCOIL Design lets you start and end in any phase of the heptad repeat
Support Vector Machines: SVMs • Classifiers • Basic “machine” is a 2-class classifier • Training Data • set of labeled vectors • {<x1, x2, …,xn, C>}, • Class: C=1 or C=-1 • Supervised learning (like neural nets) • Learn from positive and negative examples • Output • Function predicting class of unlabeled vectors
SVM Example • Alpha helix predictor • 15 residue window • 21 numbers per residue • Psi-BLAST PSSM: 20 numbers • “spacer” flag indicating “off end” of protein • 315 numbers total per window • Training samples • Non-helix samples: {<x1, x2, …, x315, -1>} • Helix samples: {<x1, x2, …, x315, 1>} • Training finds function of X that best separates the non-helix from the helix samples
SVM vs NNas Classifiers • Similarities • Compute a function on their inputs • Trained to minimize error • Differences • NNs find any hyperplane that separates the two clases • SVMs find the maximum- margin hyperplane • NNs can be engineered by designing their topology • SVMs can be tailored by designing the kernel function
SVM Details Separating Hyperplanes: Choose w, b to minimize ||w|| Subject to Dual form (support vectors) Kernel trick: replace dot products by a non-linear kernel bunction. s.t. where
Dubious Statement • “In marked contrast to NN, SVMs have few explicit parameters to fit…” • The vector of weights, w, is as long as the number of training samples • But the minimum-margin hyperplane will have most of the weights equal to zero; only the “support vectors” will have non-zero weights.