Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum blum+@cs.cmu.edu

The Approach • Learn a set of clusters or structure segments that can be identified from short local sequence • Combine a set of local structural predictions into one whole structure

Methods - Database • Database of 471 protein sequence families • By Sander & Schneider 1994 • Each family contains one known sequence structure • No more than 25% sequence identity between any 2 alignments • Well determined structures • Non-membrane proteins

Clustering of Sequence Segments • Each position in the database is described by a weighted amino acid frequency (Vingron & Argos 1989) • Similarity between a sequence and a cluster is defined by “Cross-Entropy”: • Segments of given length (3-15) were clustered via the K-means algorithm • Unsupervised

Assessing Structure within a clusterand Choice of Paradigm • Structural similarity between 2 peptide structure segments • S1i->j is the distance between -carbon atoms i and j in segments S1 • The paradigm for a cluster was chosen from the top 20 segments as the one with the smallest sum of mda/dme values with the others

True/False Boundaries in Structure Space • Used for the refinement procedure • Find Natural Boundaries • Compute Histograms of dme & mda vs the paradigm over all segments in the cluster • The boundary was set to the point where the histogram first dropped to ½ of its maximum • If reached 130o or 1.3Ao the cluster is rejected • Average boundaries is 81o and 89A • 82 cluster were constructed (I-site library)

DMA-MDA for9 residue serine B-hairpin

Iterative Refinement of Clusters • For each cluster with good boundaries • Clustering increases P(cluster|sequence) • In order to increase P(structure|cluster) • 2 residues are also observed on each side of each sequence • All segments that are not within the natural boundaries of the paradigm are removed • The frequency profile of the cluster is calculated • The database is searched using the new profile and the highest 400 scored sequences are the new cluster

Cross-Validation and confidence • A 10 fold cross validation was performed • If the 10 paradigm were not structurally the same or if the 10 runs did not converge to the same profile then the cluster was rejected • If the cluster was not rejected a confidence curve was computed as a function of the Dpq sequence to cluster similarity. • This enables to compare different profile lengths and incorporates P(clust|seq) and P(struct|clust)

Confidence for Similarity

Clustering – What do we want? • Direction: Sequence -> Structure • We want to as separated as possible cluster of sequences so that given a test sequence we can assign it to 1 cluster • Each cluster should have 1 or a few possible structures. Those structures will be used to predict the test protein structure • P(struct|seq) = clusterP(struct|clust,seq)*P(clust|seq) = P(struct|clust)* P(clust|seq)

Iterative Peak Removal • Similar Sequences can map to different structures in some cases • When this happens, the predominant pattern occludes the second one • To find those clusters the refinement was performed using subset of the data that excludes the other class members • This helped identifying two distinct -C-cap extensions which were very similar in sequence

Cluster Weights • The prediction accuracy is improved by weighting the confidence curves • Iterative update was used • Where F+C are the false positive of cluster C and F-C are the false negative errors

Prediction Protocol • Given a sequence to predict: • Submit the sequence to PHD (Rose 94) to obtain a set of multiple aligned sequences and hence a profile • Each segment of the profile is scored against each of the 82 clusters to produce weighted confidences • Confidences are sorted • The first segment assigns  &  from its paradigm • For all the subsequent segments in the sorted list the prediction is used if it doesn’t conflict with previously assigned  & 

Results • Reported on the training set and on 55 independent protein family set • Local evaluation is measured by agreement over 8 residue window • 8 residue segment prediction is considered to be correct if non of the  &  differences is larger than 120o or if the rmsd between the correct and predicted structure was less than 1.4A • An error is counted per position iff all 8 overlapping segments are incorrect • Mda is stricter than the commonly used Q3 score

Results • Training Set • 471 sequences -> 122,510 residues • 95% of 471 had 1 match ¸ 0.8 confidence • 40% of the residues had confidence ¸ 0.6 and were 71%(mda) correct

Results

Combinations of I-sites and conventional Secondary Structure Predictions • With the PHD program • Requires translation into Sec Structure or from SS into torsion angles • Every program performed better in it’s pwn domain • 64% Q3 because of under predicting loops and over predicting strands • I-site was much better in loops and specific angles of turns • Can compliment PHD

Comparison of I-Site & PHD

I-site library • 82 cluster represents 13 structural motifs

Summary of the I-site library

Conclusions • Method is fast – requires only profile comparisons • There is a measure of “confidence” in the prediction • They do not provide accuracy over the whole protein • Believe that the strong local sequence-structure relationships (that occur more than 30 times) are present in I-site

Discussion • NMR studies of isolated peptides of less than 30 residue show that the peptides do not have a well defined structure. The I-site motif are the exceptions • It might be that the motifs are the areas that adopt structure independence to the rest of the protein • An extension might be context specific motifs

2 Approaches for global scoring functions • Derived from the protein Database • Large # of parameters • Complicated • Potentials • Based on Chemical Intuitions • Simpler • Clearer insights into sequence/structure relations • They chose the Database approach • Because of the dangers of crafting a measure for a specific protein family rather than for the whole DB

Scoring Functions • P(Seq|Str) is used when computing sequence profiles for motifs • P(Structure) is hardest to estimate and contains most of the non-local interactions. • For ab-initio, P(Structure) captures the features that distinguish folded structures from random chain (local) configurations.

Radius of gryation2Scoring Function • Measures the largest radius from the center of the fold

Radius of gryation2Scoring Function • Advantages • Non-dependent on alpha-beta decomposition - since the generated structures is made from segments of real proteins its alpha-beta decomposition much like of real proteins • Disadvantages • Structures with beta paired strands are no more probable than those of unpaired beta strands

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs