TXTpred: A New Method for Protein Secondary Structure Prediction

TXTpred: A New Method for Protein Secondary Structure Prediction Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003 Biological Language Modeling Project

Roadmap • Overview on secondary structure prediction • Description of TXTpred method • Experiment results and analysis • Discussion and further work Biological Language Modeling Project

Secondary Structure of a Protein Sequence • Dictionary of Secondary Structure Predictionannotates each residue with its structure (DSSP) • based on hydrogen bonding patterns and geometrical constraints • 7 DSSP labels for PSS: • Helix types: HG(alpha-helix 3/10 helix) • Sheet types: BE (isolated beta-bridge strand) • Coil types: T _ S(Coil) Biological Language Modeling Project

Secondary Structure of a Protein Sequence • Accuracy Limit ~ 88% Biological Language Modeling Project

Task Definition • Given a protein sequence: • APAFSVSPASGA • Predict its secondary structure sequence: • CCEEEEECCCCC • Focus on soluble proteins, not on membrane protein Biological Language Modeling Project

Overview of Previous Work -1 • 1st-generation method • Calculate propensities for each amino acid • E.g. Chou-Fasman method (Chou & Fasman, 1974) • 2nd-generation method • “Window” concept • APAFSVSPAS (window size = 7) • Calculate propensities for segments of 3-51 amino acids • E.g. GOR method (Garnier et al, 1978) Biological Language Modeling Project

Overview of Previous Work -2 • 3rd-generation method • Use evolutional information multiple sequence alignment • p-Value cut-off = 10-2 • PHD: Neural Network & Sequence features only (Rost & Sander, 1993) • DSC: LDA & Biological features: GOR, hydrophobicity etc. (King & Sternberg, 1996) • Later Refinement • Apply divergent sequence alignment: e.g. PROF (Ouali & King, 2000) • Combine results of different system: e.g. Jpred (Cuff & Barton, 1999) • Bayesian Segmentation (Schmidler et al, 1999) Biological Language Modeling Project

Summary of Performance Biological Language Modeling Project

Disadvantage of Previous Work • Most are “black box” predictors • Weak biological meanings • Little focus on long-range interaction • Mostly focused on local information • Performance is asymptotically bounded Biological Language Modeling Project

Roadmap • Overview on secondary structure prediction • Description of TXTpred method • Experiment results and analysis • Discussion and further work Biological Language Modeling Project

TXTpred • Basic idea: • Build meaningful biological vocabulary • Apply language technique for prediction • Major challenge: • How to build the vocabulary? • Context-free N-gram of amino acids inside the window • Sq: APAFSVSPAS (window = 7) • N-gram: P, A, ..,P, PA, AF, ..SP, PAF, AFS,..,VSP Biological Language Modeling Project

Biological Vocabulary • Context sensitive vocabulary • Analogy • Same word might have different meanings: e.g. “bank” • Same amino acid might have different properties: APAFSVSPAS • Encode context semantics into the N-gram • Record the position information in the N-gram • Example: APAFSVSPAS (window size = 7) • Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1 Biological Language Modeling Project

Text Classification • Text classification • Analogy • The topic of a document is expressed by the words of the document • The structure of one residue can be inferred from the biological words nearby • High Accuracy • Text Classification Technique • Doc to Vectors: • Classifiers: Support Vector Machines Biological Language Modeling Project

TXTpred Method Settings: Window = 17 One-gram, two-gram Feature Num = 3000 Biological Language Modeling Project

Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients Evaluation Measure Biological Language Modeling Project

Experimental Results • RS126 datasets • CB513 datasets Biological Language Modeling Project

Biological language PropertiesPower Law? Term Frequency = f(Rank) One-gram Two-gram Biological Language Modeling Project

Top ten Discriminating features for Helix Verification by Chou-Fasman parameters Helix favors A, E, M, L, K (top 5 amino acids) disfavors P (top 1 amino acid) Sequence Analysis -1Feature Selection Biological Language Modeling Project

Top ten Discriminating features for Sheet Verification by Chou-Fasman parameters Sheets favors V, I, Y, F, W (top 5 amino acids) Disfavors D, E (top 2 amino acids) Sequence Analysis -1Feature Selection Biological Language Modeling Project

Top ten Discriminating features for Coil Verification by Chou-Fasman parameters Coil favors N, P, G, D, S (top 5 amino acids) Disfavors V, I, L (top 3 amino acids) Sequence Analysis -1Feature Selection Biological Language Modeling Project

Sequence Analysis –2Word Correlation • Word correlation • Some words have strong correlation and co-occur frequently • Technique: Singular Vector Decomposition • Examples from texts • Phrases: {president, Bush} • Semantic correlated: {Olympic, sports} Biological Language Modeling Project

Sequence Analysis – 2 Word Correlation • Top ten correlated word pairs Biological Language Modeling Project

Sequence Analysis – 2 Word Correlation Biological Language Modeling Project

Conclusion • TXTpred Summary • Context sensitive biological vocabulary • Novel application of text classification to secondary structure prediction • Comparable performance for secondary structure prediction • Analysis provides reasonable biological meanings and structure indicators Biological Language Modeling Project

Future Work • Deeper study on extracting more meaningful biological vocabulary • Further discovery of new features, such as torsion angle and free energy • Advanced learning models to consider long-range interactions • Conditional random fields, Maximum entropy markov model Biological Language Modeling Project

Acknowledgement • Vanathi Gopalakrishnan, Upitt • Ivet Barhar, UPitt Biological Language Modeling Project

Motivation for 2-D prediction • Basis for three-dimensional structure prediction • Improving other sequence and structure analysis • Sequence alignment • Threading and homologous modeling • Experimental data • Protein design Biological Language Modeling Project

TXTpred: A New Method for Protein Secondary Structure Prediction