1 / 27

TXTpred: A New Method for Protein Secondary Structure Prediction

TXTpred: A New Method for Protein Secondary Structure Prediction. Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003. Roadmap. Overview on secondary structure prediction Description of TXTpred method

dagmar
Download Presentation

TXTpred: A New Method for Protein Secondary Structure Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TXTpred: A New Method for Protein Secondary Structure Prediction Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003 Biological Language Modeling Project

  2. Roadmap • Overview on secondary structure prediction • Description of TXTpred method • Experiment results and analysis • Discussion and further work Biological Language Modeling Project

  3. Secondary Structure of a Protein Sequence • Dictionary of Secondary Structure Predictionannotates each residue with its structure (DSSP) • based on hydrogen bonding patterns and geometrical constraints • 7 DSSP labels for PSS: • Helix types: HG(alpha-helix 3/10 helix) • Sheet types: BE (isolated beta-bridge strand) • Coil types: T _ S(Coil) Biological Language Modeling Project

  4. Secondary Structure of a Protein Sequence • Accuracy Limit ~ 88% Biological Language Modeling Project

  5. Task Definition • Given a protein sequence: • APAFSVSPASGA • Predict its secondary structure sequence: • CCEEEEECCCCC • Focus on soluble proteins, not on membrane protein Biological Language Modeling Project

  6. Overview of Previous Work -1 • 1st-generation method • Calculate propensities for each amino acid • E.g. Chou-Fasman method (Chou & Fasman, 1974) • 2nd-generation method • “Window” concept • APAFSVSPAS (window size = 7) • Calculate propensities for segments of 3-51 amino acids • E.g. GOR method (Garnier et al, 1978) Biological Language Modeling Project

  7. Overview of Previous Work -2 • 3rd-generation method • Use evolutional information multiple sequence alignment • p-Value cut-off = 10-2 • PHD: Neural Network & Sequence features only (Rost & Sander, 1993) • DSC: LDA & Biological features: GOR, hydrophobicity etc. (King & Sternberg, 1996) • Later Refinement • Apply divergent sequence alignment: e.g. PROF (Ouali & King, 2000) • Combine results of different system: e.g. Jpred (Cuff & Barton, 1999) • Bayesian Segmentation (Schmidler et al, 1999) Biological Language Modeling Project

  8. Summary of Performance Biological Language Modeling Project

  9. Disadvantage of Previous Work • Most are “black box” predictors • Weak biological meanings • Little focus on long-range interaction • Mostly focused on local information • Performance is asymptotically bounded Biological Language Modeling Project

  10. Roadmap • Overview on secondary structure prediction • Description of TXTpred method • Experiment results and analysis • Discussion and further work Biological Language Modeling Project

  11. TXTpred • Basic idea: • Build meaningful biological vocabulary • Apply language technique for prediction • Major challenge: • How to build the vocabulary? • Context-free N-gram of amino acids inside the window • Sq: APAFSVSPAS (window = 7) • N-gram: P, A, ..,P, PA, AF, ..SP, PAF, AFS,..,VSP Biological Language Modeling Project

  12. Biological Vocabulary • Context sensitive vocabulary • Analogy • Same word might have different meanings: e.g. “bank” • Same amino acid might have different properties: APAFSVSPAS • Encode context semantics into the N-gram • Record the position information in the N-gram • Example: APAFSVSPAS (window size = 7) • Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1 Biological Language Modeling Project

  13. Text Classification • Text classification • Analogy • The topic of a document is expressed by the words of the document • The structure of one residue can be inferred from the biological words nearby • High Accuracy • Text Classification Technique • Doc to Vectors: • Classifiers: Support Vector Machines Biological Language Modeling Project

  14. TXTpred Method Settings: Window = 17 One-gram, two-gram Feature Num = 3000 Biological Language Modeling Project

  15. Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients Evaluation Measure Biological Language Modeling Project

  16. Experimental Results • RS126 datasets • CB513 datasets Biological Language Modeling Project

  17. Biological language PropertiesPower Law? Term Frequency = f(Rank) One-gram Two-gram Biological Language Modeling Project

  18. Top ten Discriminating features for Helix Verification by Chou-Fasman parameters Helix favors A, E, M, L, K (top 5 amino acids) disfavors P (top 1 amino acid) Sequence Analysis -1Feature Selection Biological Language Modeling Project

  19. Top ten Discriminating features for Sheet Verification by Chou-Fasman parameters Sheets favors V, I, Y, F, W (top 5 amino acids) Disfavors D, E (top 2 amino acids) Sequence Analysis -1Feature Selection Biological Language Modeling Project

  20. Top ten Discriminating features for Coil Verification by Chou-Fasman parameters Coil favors N, P, G, D, S (top 5 amino acids) Disfavors V, I, L (top 3 amino acids) Sequence Analysis -1Feature Selection Biological Language Modeling Project

  21. Sequence Analysis –2Word Correlation • Word correlation • Some words have strong correlation and co-occur frequently • Technique: Singular Vector Decomposition • Examples from texts • Phrases: {president, Bush} • Semantic correlated: {Olympic, sports} Biological Language Modeling Project

  22. Sequence Analysis – 2 Word Correlation • Top ten correlated word pairs Biological Language Modeling Project

  23. Sequence Analysis – 2 Word Correlation Biological Language Modeling Project

  24. Conclusion • TXTpred Summary • Context sensitive biological vocabulary • Novel application of text classification to secondary structure prediction • Comparable performance for secondary structure prediction • Analysis provides reasonable biological meanings and structure indicators Biological Language Modeling Project

  25. Future Work • Deeper study on extracting more meaningful biological vocabulary • Further discovery of new features, such as torsion angle and free energy • Advanced learning models to consider long-range interactions • Conditional random fields, Maximum entropy markov model Biological Language Modeling Project

  26. Acknowledgement • Vanathi Gopalakrishnan, Upitt • Ivet Barhar, UPitt Biological Language Modeling Project

  27. Motivation for 2-D prediction • Basis for three-dimensional structure prediction • Improving other sequence and structure analysis • Sequence alignment • Threading and homologous modeling • Experimental data • Protein design Biological Language Modeling Project

More Related