1 / 44

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction. Rajkumar Bondugula, Ognen Duzlevski and Dong Xu. Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA. Outline. Introduction

laszlo
Download Presentation

Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profiles and Fuzzy K-Nearest Neighbor Algorithm for Protein Secondary Structure Prediction Rajkumar Bondugula, Ognen Duzlevski and Dong Xu Digital Biology Laboratory, Dept. of Computer Science University of Missouri – Columbia, MO 65211, USA

  2. Outline • Introduction • Protein secondary structure prediction • Popular methods • K-Nearest Neighbor method • Fuzzy K-Nearest Neighbor method • Methods • Filtering the prediction • Results and discussion • Summary and Future work

  3. Introduction • Goal:Given a sequence of amino acids, predict in which one of the eight possible secondary structures states {H, G, I, B, E, C, S,T} will each residue fold in to. • CASP convention • {H,G,I} → H • {B,E} → E • {C,S,T}→ C • Example: Amino Acid VKDGYIVDXVNCTYFCGRNAYCNEECTKLXGEQWASPYYCYXLPDHVRTKGPGRCH Secondary Structure CEEEEEECCCCCCCCCCCHHHHHHHHHHCCCCEEEECCEEEEECCCCCCCCCCCCC

  4. Protein 3-Dimensional structure

  5. Importance of Secondary Structure • An intermediate step in 3D structure prediction • structure → function • Classification • Ex: α, β, α/β, α+β • Helps in protein folding pathway determination

  6. Existing Methods • Popular Methods • Neural Network methods • Ex: PSIPRED, PHD • Nearest Neighbor methods • Ex: NNSSP • Hidden Markov Model methods

  7. Why K-Nearest Neighbors method? • Methods based on Neural Networks and Hidden Markov models • perform well if the query protein have many homologs in the sequence database • not easily expandable • The 1-Nearest Neighbor rule is bound above by no more than twice the optimal Baye’s error rate [Keller et. al, 1985] • K-NN will work better and better as more and more structures are being solved

  8. K-Nearest Neighbor Algorithm Instances to be classified Classified instances

  9. K-Nearest Neighbor Algorithm Instances to be classified Classified instances

  10. K-Nearest Neighbor Algorithm class F Instances to be classified class B

  11. K-Nearest Neighbor Algorithm • Advantages of Nearest Neighbor methods • Simple and transparent model • New structures can be added without re-training • Linear complexity • Disadvantage • Slower compared to other models as processing is delayed until prediction is needed

  12. Why Fuzzy K-NN? • Disadvantages of Crisp K-NN • Atypical examples are given as much as weight as those that truly represent a particular class • Once instance is assigned to a class, there is no indication of its “strength” of its membership in that class

  13. - - - N L G A G N S G L N L G H V A L T F

  14. - - - N L G A G N S G L N L G H V A L T F - - - N L G A

  15. - - - N L G A G N S G L N L G H V A L T F - - - N L G A - - N L G A G

  16. - - - N L G A G N S G L N L G H V A L T F - - - N L G A - - N L G A G - N L G A G N

  17. - - - N L G A G N S G L N L G H V A L T F - - - N L G A - - N L G A G - - N L G A N N L G A G N S

  18. - - - N L G A G N S G L N L G H V A L T F L G A G N S G - - - N L G A - - N L G A G - N L G A G N S

  19. . . . N L G A G N S G L N L G H V A L T F . . . PSI-BLAST . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . 20 Length of protein(l) Position Specific Scoring Matrix A R N D C Q E G H I L K M F P S T W Y V

  20. Why Profile-FKNN? • Evolutionary information has been shown to increase the accuracy of secondary structure prediction by many popular methods • An attempt to combine the advantages of incorporating the evolutionary information, fuzzy set theory and nearest neighbor methods

  21. Methods • Calculate profiles using PSI-BLAST • The popular Rost and Sander database of 126 representative proteins (<25% sequence Identity) • Find K-Nearest Neighbors • Calculate the membership values of the neighbors • Calculate the membership values of the current residue • Assign classes • Filter the output

  22. Profile Calculation • The profiles of both the query protein and the test protein are calculated using the program PSI-BLAST • Parameters for PSI-BLAST • Expectation Value (e) = 0.1 • Maximum number of passes (j) = 3 • E-value threshold for inclusion in multi-pass model (h) = 5 • Default values for the rest of the parameters

  23. K-Nearest Neighbors • For each profile-window in the query protein, the position-weighted absolute distance ‘d’ is calculated from all profile-windows of all proteins in the database. • The profile-windows corresponding to K smallest distances are retained as the K-Nearest Neighbors

  24. . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation

  25. . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation

  26. . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation

  27. . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation

  28. . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . Distance Calculation

  29. Distance Calculation . . . N L G A G N S G L T F . . . . . . N L G A G N S G L N L G H V A L T F . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 2 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4. . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 -2 . . . . . . 0 -3 -1 -2 -3 -2 -3 6 -3 -4 -4 -2 -3 -4 -3 -1 -2 3 . . . . . . 0 -3 -4 -4 -2 -3 -3 -3 -3 1 5 -3 1 0 -3 -2 -2 2 . . . . . . 2 -1 0 -1 -1 -1 -1 -1 -2 -3 -3 -1 -2 -3 -1 5 1 3 . . . . . . -2 -2 3 6 -4 -1 1 -2 -1 -4 -4 -1 -4 -4 -2 -1 -1 5 . . . . . . 2 -3 -3 -3 -2 -2 -3 -2 -3 1 0 -2 0 4 -3 -1 -1 2 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -4 -4 -1 -3 -3 -4 -4 2 0 -3 0 -1 -3 -2 -1 3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . . . . . 0 -1 0 -1 -1 -1 -1 -1 -2 -2 -3 -1 -2 -3 -1 4 4 3 . . . . . . 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 . . . . . . -1 -3 -2 -2 -3 -2 -2 -3 -3 -3 -3 -1 -3 -4 8 -1 -2 -4 . . . . . . 3 -2 -1 -2 -1 -2 -2 2 -2 -2 -2 -1 -2 -3 -2 0 3 -3 . . . . . . 0 -2 0 -1 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -2 4 4 3 . . . . . . -1 -3 -3 -2 -3 -2 -2 -3 -3 -3 -4 -2 -3 -4 8 -1 -2 4 . . . . . . 4 -2 -1 -2 -1 -1 -1 -1 -2 -2 -2 -1 -2 -3 -1 3 0 3 . . .

  30. Membership Values of the Neighbors • The memberships of the nearest neighbors are assigned based on their corresponding secondary structures in various positions in the window • The residues near to the center are weighed more than the residues that are farther away

  31. N L G A G N S A 0.067 0.133 0.20 0.20 0.20 0.133 0.067 H E 1 1 1 C 1 1 1 1 C C E E E C C H = 0 E = 0.200x1 + 0.200x1 + 0.20x1 = 0.6 C = 0.067x1 + 0.133x1 +0.133x1 + 0.067x1 = 0.4 C C E E E C C E Membership values of the Neighbors

  32. Membership Value • The membership values of each residue in classes Helix, Sheet and Coil is calculated from the corresponding neighbors using the Fuzzy K-NN algorithm • Each residue is assigned to class in which it has the highest membership value Helix = . . . 15 22 61 91 95 9626 21 23 18 29 30 24 17 5 8 . . . Sheet = . . . 22 28 13 1 1 2 8 8 12 11 42 44 46 29 14 10 . . . Coil = . . . 6350 26 8 4 2 65 71 65 71 29 26 31 53 81 82 . . . Final = . . . C CH H H HC C C CE E EC C C . . .

  33. Fuzzy K-Nearest neighbor Algorithm BEGIN Initialize i=1. DO UNTIL(r assigned membership in all classes) Compute ui(r) using Increment i. END DO UNTIL END Where, ui= membership value of residue ‘r’ in class ‘i’, i = Helix, Sheet or Coil d(r,rj)= distance between query window centered in residue ‘r’ its jth neighbor m = 2 (Fuzzifier)

  34. Structure Filtration • In the basic setting, the secondary structure state is class with highest membership value • Unrealistic structures may be present • Popular methods of structure filtration • Neural Network • Heuristic based

  35. Heuristic Filter • Smoothen the memberships values • Filter unrealistic structures • Helix > 3 amino acids, -sheet > 2 amino acids • Calculate the thresholds to filter noise • Mark the possible Helix and Sheet regions • Resolve conflicts based on average membership value in overlap region • Fill the rest of the structure with Coil

  36. Filter: Final Structure Unfiltered CCCCCHCCCCCHHHHHHHHCCCCCCEEEEECCCCCCCCCCCCCEEEEEECCCCCCHHHCCCCC Target CCCHHHCCCCHHHHHHHHHHHCCCCEEEEEECCCCEECCCCCCEEEEEEECCCCEECCCCEEC Filtered CCHHHHCCCHHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCEEEEEEECCCCCCCCCCCCC

  37. Metrics • Seven commonly used metrics • Q3 = Number of correctly predicted residues x 100 Total number of residues • Q<H,E,C>= Number of <helix,sheet,coil> residues correctly predicted X100 Total number of residues in <helix,sheet,coil> • Matthew’s Correlation Coefficient MCC<H,E,C>= where, p – true positives n – true negatives u – false negatives o – false positives

  38. Results Performance on database of 1973 proteins (<25% sequence identity) generated by the PISCES1 server 1. G. Wang and R. L. Dunbrack, Jr. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003.

  39. Relative Performance • X. Zhang, J. P. Mesirov and D.L Waltz. Hybrid system for Protein Secondary Structure Prediction. J. Mol. Biol., 225:1049-1063, 1992 • Tau-Mu Yi and E. S. Lander. Protein Secondary Structure Prediction using Nearest-Neighbor Methods. J. Mol. Biol., 232:1117-1129, 1993 • A. A. Salamov and V. V. Solovyev. Prediction of Protein Secondary Structure by Combining Nearest-neighbor Algorithm and Multiple Sequence Alignments. J. Mol. Biol., 247:11-15, 1995

  40. Summary • A novel approach for PSSP • Evolutionary information • K-Nearest Neighbor algorithm • Fuzzy set theory • Most accurate KNN approach to date • Easily expandable • Accuracy increases with new structures • Average computing time < 1 min on a single CPU machine

  41. Future Work • System with faster search capabilities • Efficient search for neighbors • Accurate prediction system

  42. Acknowledgements • Dr. James Keller for insight into the Fuzzy K-Nearest Neighbor Algorithm • Oak Ridge National Laboratory for providing the supercomputing facilities • Members of Digital Biology Laboratory for their support

  43. Software The enhanced version of the software is coded in C and is available upon request. Please e-mail your requests to Raj@mizzou.edu or XuDong@missouri.edu

  44. Thank you for Participation!

More Related