Predicting Structural Features

Predicting Structural Features Chapter 12

Structural Features • Phosphorylation sites • Transmembrane helices • Protein flexibility

Accuracy Measures Revisited • Level: • Individual residues • Complete helix or strand

Residue-Level Measures • Q3 • Percentage of residues predicted correctly • If one state (eg, Coil) is very common (eg, 50%), blind guessing can give a large Q3! • Matthew’s correlation coefficient • C= (TPxTN - FNxFP)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN) • Defined for each state • More balanced than Q3; in range ±1 • Random prediction: C = 0

Structural Element-Level Measures • SOV • based on the overlap of predicted “segments” of helix, strand etc. with the observed segments of the same type • The N-score • specialized for transmembrane protein predictors • Should TMHMM2 be changed? Should your model?

Predicting Helices • Residue propensities: • score for a given structure class for each residue, a • P(H | a) is proportional to P(a | H) / P(a) • Why? Bayes’ Rule is your friend! • P(H | a) = P(a | H)P(H) / P(a) • P(H) doesn’t depend on a, so • P(H | a) proportional to P(a | H) / P(a) Can this be used to see how to group helix states?

Identical short segments rarely fold differently • Local sequence is highly important to secondary structure. • But, this sequence occurs in two proteins and takes very different forms: • KGVVPQLVK • There is significant information about structure in local sequence.

I-sites Sequence Database • About 250 short segments (3-19 residues) that show strong correlation between sequence and structure • Example shows: • phi and psi angles, log-odds matrix • superimposed backbones • representative structure

Nearest Neighbor Prediction Methods • Predict secondary structure based on: • Local alignments of the query sequence to a database of sequences of known structure • Alignment score functions are often special-purpose, and may include helix/sheet/coil “propensity” information • Homologous sequences are often included in the database • Prediction based on weighted votes of nearest neighbors (usually only central residue of alignment is predicted) • 73.5% Accuracy (Q3)

A different application: prediction of misfolding • Diseases such as Alzheimer’s involve protein misfolding. • Usually, the misfolded region ends up as Beta-strands. • How could we use secondary structure information to predict which proteins will potentially misfold?

HPHidden Beta Propensity • Key idea: Tertiary contacts (TC) • TC is number of contacts a residue has with others at least 4 residues away • Alpha helices tend to be in regions of HIGH TC • Beta strands tend to be in regions of LOW TC • Look for query residues whose nearest neighbors are “strange” with respect to TC and alpha/beta state: • Low TC regions with lots of Alphas • High TC regions with lots of Betas • Performance results?

Neural Nets • Each node computes a simple function of its inputs. • The weighted sum of the inputs are added to a bias term and “squashed”: • I =  w-1 • (I+) • The output, , is then propagated to nodes in the next layer.

Training Neural Nets • Back-propagation • Optimizes the weights and bias terms • Minimize the error function (difference between predicted and observed) • RMS • Relative Entropy • Iterative process • Final weights shown for a secondary structure NN alpha helix output layer. • Over-fitting can be reduced by training for fewer iterations

Adaptive Encoding and Weight Sharing • Orthogonal encoding • Each residue feeds three hidden nodes • The weights for all red nodes are tied together • Each group of three nodes learns the same “encoding” of the 20 amino acids

Engineering Intuition Into NNs • Alpha helices have a period of 3.6 residues per turn • A NN can be specially designed to reflect that • Using this, plus adaptive encoding: • Q3 = 66% • Adding homology: Q3 = 73%

HMMs and Transmembrane Proteins (again)

HMMTOP Architecture • TMHs 17-25 residues • Tails 1-15 residues • Blue letters show structural state labels

TMHMM Architecture • Helices are 5-25 residues • Caps follow helices • Cytoplasmic: • Loop: 0-20 residues • Globular: 1 state • Extra-cellular: • Long loop: 0-100 residues • Globular: 3 states

Predicting Globular Proteins with “Hidden Neural Networks” • YASPIN • Neural net predicts seven classes (He,H, Hb,C,Ee,E,Eb) using 15-residue window of PSSM input • HMM “filters” this output • Can you imagine how this is done?

Coiled-coil HMMMARCOIL Design lets you start and end in any phase of the heptad repeat

Support Vector Machines: SVMs • Classifiers • Basic “machine” is a 2-class classifier • Training Data • set of labeled vectors • {<x1, x2, …,xn, C>}, • Class: C=1 or C=-1 • Supervised learning (like neural nets) • Learn from positive and negative examples • Output • Function predicting class of unlabeled vectors

SVM Example • Alpha helix predictor • 15 residue window • 21 numbers per residue • Psi-BLAST PSSM: 20 numbers • “spacer” flag indicating “off end” of protein • 315 numbers total per window • Training samples • Non-helix samples: {<x1, x2, …, x315, -1>} • Helix samples: {<x1, x2, …, x315, 1>} • Training finds function of X that best separates the non-helix from the helix samples

SVM vs NNas Classifiers • Similarities • Compute a function on their inputs • Trained to minimize error • Differences • NNs find any hyperplane that separates the two clases • SVMs find the maximum- margin hyperplane • NNs can be engineered by designing their topology • SVMs can be tailored by designing the kernel function

SVM Details Separating Hyperplanes: Choose w, b to minimize ||w|| Subject to Dual form (support vectors) Kernel trick: replace dot products by a non-linear kernel bunction. s.t. where

Dubious Statement • “In marked contrast to NN, SVMs have few explicit parameters to fit…” • The vector of weights, w, is as long as the number of training samples • But the minimum-margin hyperplane will have most of the weights equal to zero; only the “support vectors” will have non-zero weights.

Predicting Structural Features

Predicting Structural Features

Presentation Transcript

Structural Features of Democracy in Developing Countries

Orientations of structural features

Predicting the immunogenicity of structural HLA class I epitopes Reyna Goodman

Predicting Performance

Predicting Homicides

Predicting Weather

Predicting

Predicting Heights

PREDICTING

Predicting Products

Predicting functional surface patches on protein structural models

Structural features that govern enzymatic activity of Carbonic

Predicting

Predicting patterns of biological performance using chemical substructure features

Predicting eclipses

Predicting Products

Module 3 – Advanced Features: Part I - Structural Diagrams

Understanding Structural Features of Informational and Technical Materials

Predicting Performance

Predicting

Predicting Reactions

PREDICTING PREDICTABILITY