490 likes | 785 Views
Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction. Rajkumar Bondugula, Ognen Duzlevski and Dong Xu. Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA. Outline. Introduction
E N D
Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA
Outline • Introduction • Protein secondary structure prediction • Popular methods • K-Nearest Neighbor method • Fuzzy K-Nearest Neighbor method • Methods • Filtering the prediction • Results and discussion • Summary and Future work
Introduction • Goal:Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to. • CASP convention • {H,G,I} → H • {B,E} → E • {C,S,T}→ C • Example: Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCH Secondary Structure CEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC
Importance of Secondary Structure • An intermediate step in 3D structure prediction • structure → function • Classification • Ex: α, β, α/β, α+β • Helps in protein folding pathway determination
Existing Methods • Popular Methods • Neural Network methods • Ex: PSIPRED, PHD • Nearest Neighbor methods • Ex: NNSSP • Hidden Markov Model methods
Why K-Nearest Neighbors method? • Methods based on Neural Networks and Hidden Markov models • perform well if the query protein have many homologs in the sequence database • not easily expandable • The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985] • K-NN will work better and better as more and more structures are being solved
K-Nearest Neighbor Algorithm Instances to be classified Classified instances
K-Nearest Neighbor Algorithm Instances to be classified Classified instances
K-Nearest Neighbor Algorithm class F Instances to be classified class B
K-Nearest Neighbor Algorithm • Advantages of Nearest Neighbor methods • Simple and transparent model • New structures can be added without re-training • Linear complexity • Disadvantage • Slower compared to other models as processing is delayed until prediction is needed
Why Fuzzy K-NN? • Disadvantages of Crisp K-NN • Atypical examples are given as much as weight as those that truly represent a particular class • Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class
- - - N L G A G N S G L N L G H V A L T F - - - N L G A
- - - N L G A G N S G L N L G H V A L T F - - - N L G A - - N L G A G
- - - N L G A G N S G L N L G H V A L T F - - - N L G A - - N L G A G - N L G A G N
- - - N L G A G N S G L N L G H V A L T F - - - N L G A - - N L G A G - - N L G A N N L G A G N S
- - - N L G A G N S G L N L G H V A L T F L G A G N S G - - - N L G A - - N L G A G - N L G A G N S
. . . N L G A G N S G L N L G H V A L T F . . . PSI-BLAST . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . 20 Length of protein(l) Position Specific Scoring Matrix A R N D C Q E G H I L K M F P S T W Y V
Why Profile-FKNN? • Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods • An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods
Methods • Calculate profiles using PSI-BLAST • The popular Rost and Sander database of 126 representative proteins (<25% sequence Identity) • Find K-Nearest Neighbors • Calculate the membership values of the neighbors • Calculate the membership values of the current residue • Assign classes • Filter the output
Profile Calculation • The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST • Parameters for PSI-BLAST • Expectation Value (e) = 0.1 • Maximum number of passes (j) = 3 • E-value threshold for inclusion in multi-pass model (h) = 5 • Default values for the rest of the parameters
K-Nearest Neighbors • For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database. • The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors
. . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation
. . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation
. . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation
. . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation
. . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation
Distance Calculation . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .
Membership Values of the Neighbors • The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window • The residues near to the center are weighed more than the residues that are farther away
N L G A G N S A 0.067 0.133 0.20 0.20 0.20 0.133 0.067 H E 1 1 1 C 1 1 1 1 C C E E E C C H = 0 E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6 C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4 C C E E E C C E Membership values of the Neighbors
Membership Value • The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm • Each residue is assigned to class in which it has the highest membership value Helix = . . . 15 22 61 91 95 9626 21 23 18 29 30 24 17 5 8 . . . Sheet = . . . 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10 . . . Coil = . . . 6350 26 8 4 2 65 71 65 71 29 26 31 53 81 82 . . . Final = . . . C CH H H HC C C CE E EC C C . . .
Fuzzy K-Nearest neighbor Algorithm BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute ui(r) using Increment i. END DO UNTIL END Where, ui= membership value of residue ‘r’ in class ‘i’, i = Helix, Sheet or Coil d(r,rj)= distance between query window centered in residue ‘r’ its jth neighbor m = 2 (Fuzzifier)
Structure Filtration • In the basic setting, the secondary structure state is class with highest membership value • Unrealistic structures may be present • Popular methods of structure filtration • Neural Network • Heuristic based
Heuristic Filter • Smoothen the memberships values • Filter unrealistic structures • Helix > 3 amino acids, -sheet > 2 amino acids • Calculate the thresholds to filter noise • Mark the possible Helix and Sheet regions • Resolve conflicts based on average membership value in overlap region • Fill the rest of the structure with Coil
Filter: Final Structure Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCC Target CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC Filtered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC
Metrics • Seven commonly used metrics • Q3 = Number of correctly predicted residues x 100 Total number of residues • Q<H,E,C>= Number of <helix,sheet,coil> residues correctly predicted X100 Total number of residues in <helix,sheet,coil> • Matthew’s Correlation Coefficient MCC<H,E,C>= where, p – true positives n – true negatives u – false negatives o – false positives
Results Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES1 server 1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.
Relative Performance • X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992 • Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993 • A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995
Summary • A novel approach for PSSP • Evolutionary information • K-Nearest Neighbor algorithm • Fuzzy set theory • Most accurate KNN approach to date • Easily expandable • Accuracy increases with new structures • Average computing time < 1 min on a single CPU machine
Future Work • System with faster search capabilities • Efficient search for neighbors • Accurate prediction system
Acknowledgements • Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm • Oak Ridge National Laboratory for providing the supercomputing facilities • Members of Digital Biology Laboratory for their support
Software The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to Raj@mizzou.edu or XuDong@missouri.edu
Thank you for Participation!