Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method

Protein Secondary Structure Prediction:A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

Outline • Introduction • PSSP • Motivation • Knowledge-Based Method • PROSP • An Improved Hybrid Method • PROSP II • HYPROSP II+ • Conclusion

Protein Structures • Primary sequence • Secondary structures • Tertiary structures MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE loops helices strands Three dimensional packing of secondary structures

Introduction to PSSP • Protein Secondary Structure Prediction (PSSP) is to predict protein secondary structure based only on its sequence. • Each amino acid is assigned a structure element (SSE): • Helix (H), Strand (E) or Coil (C or L).

Motivation • PSSP plays an important role in tertiary structure predictions • Fischer (1996) improved the tertiary structure prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE. • In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE. • Predicted SSE can also be employed in other prediction algorithms as features to improve performance

Treat PSSP as a Translation Problem • Secondary structure prediction • A language of 20 alphabets • A language of 3 alphabets

Treating Genomic/Proteomic sequencesas a Language • For proteomic data: Amino acid motif protein Alphabetwordsentence paragraph Protein structure or function Sentence meaning • Finding the interrelationships of data • Data Mining, Knowledge Discovery

Speech Recognition─ ExampleSense Disambiguation in English • Selection of homonyms (or senses) in speech recognition 台北市一位小孩走失了台北市小孩台北適宜走失事宜一位一味移位

How do we represent the context in a protein sequence (or sentence)? • Using motifs as Words? • Motifs could be too specific, do not provide enough coverage • What about using k-mers? • Can build (k-mer, structure) pairs • How many k-mers can we get? • How do we define similar k-mers? (under the context) • How do we combine the structural information from the k-mers?

PROSP Our knowledge-based method for PSSP • Constructing a peptide Sequence-Structure Knowledge Base (SSKB) • Use PSI-BLAST to find all peptides similar to those of the target protein • Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.

Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms) • The number of peptide words is still small (~ 5 million) • Identify similar peptides • For each protein p in the NR database, apply PSI-BLAST to find its HSPs (high score segment pairs). • HSP: an alignment of subsequence of protein p and another protein q with unknown structure • Assign the structure of “selected” peptides of p to those of q • These peptides comprise our dictionary (~ 100 million)

known unknown SSKB construction (synonyms) An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result

SSKB PSI-Blast H H H C E C Prediction at a position x x … H(x) E(x) C(x) x is assigned as helix Voting score

Two problems of searching for homologous peptides in protein sequences databases • Redundant information generated by duplicate peptides • The voting bias problem in PROSP • Poor prediction accuracy due to insufficient knowledgebase matching • boost coverage

KTYQCQY… KTYQCQY… KPYQCQY HHHHHH KPYQCQY HHHHHH KPYQCQY HHHHHH KPYQCQY HHHHHH KPYQCQY HHHHHH KVYQCQY CCHHHC QPYRCKY CCHHHC The voting bias problem The PSIBLAST results Query Sbject Dominate result SSKB

MYSKILL MYSKILL MYKKIYL MYKKIYL MYKKIYL Clustering HSPs …MYKKILYPTDFSETAEIALK… MYSKILL MYKKIYL Similar HSPs MYSSILY MYSSILY

Measuring the amount of structural information • Low Local match rate HSPs There is no information from SSKB7 for this region Found Unfound

Training Protein PSI-BLAST search HSPs SSKB construction window length = 5 SSKBwindow length = 5 Construct SSKB with different lengths (to boost coverage) Training Protein PSI-BLAST search HSPs SSKB construction window length = 7 SSKBwindow length = 7

H 1 2 1 3 6 7 8… H 1 3 2 5 5 5 2… E 1 2 2 0 0 0 1… E 1 3 2 0 0 0 1… C 2 3 8 8 5 4 2… C 2 4 7 7 6 6 7… Boost match rate using different length peptide record HSPs from SSKB7 Protein : MYKKILYPTDFSETAEIALK… HSPs from SSKB5 SSKB Window length = 7 SSKB Window length = 5

H 1 2 1 3 6 7 8… H H 1 3 2 5 5 5 2… 1 3 2 5 7 6 7… E 1 2 2 0 0 0 1… E E 1 3 2 0 0 0 1… 1 3 2 0 0 0 1… C 2 3 8 8 5 4 2… C C 2 4 8 8 4 5 6… 2 4 7 7 6 6 7… NEW PROSP system Protein : MYKKILYPTDFSETAEIALK… SSKB Window length = 7 SSKB Window length = 5 HPROSPII(x) ← LMR7mer(x)×H7(x)+(1- LMR7mer(x))×H5(x) EPROSPII(x) ← LMR7mer(x)×E7(x)+(1- LMR7mer(x))×E5(x) CPROSPII(x) ← LMR7mer(x)×C7(x)+(1- LMR7mer(x))×C5(x)

3 features H score H score E score PSIPRED C score 3 features E score PROSP 20 features C score PSSM PSIPBLAST Hybrid by Neural Network Neural Network Final Result Query Protein

Data Sets • Two broadly used test sets • CB513 • EVAc4 • Derivation of the training sets • Get 4,572 unique protein chains (with less than 25% mutual sequence identity) from DSSP database • Further remove protein chains of sequence identity over 25% with the respective test datasets to obtain their respective training datasets. • The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.

The respective performance improvement using SSKB5 and SSKB7 Q3(%) LMR7mer(%) Performance of prediction on CB513 by SSKB5, SSKB7 and PROSP II with respect to LMR7mer lower than 50%.

Performance of HYPROSP II+

Conclusion HYPROSP II+ • Using a more robust knowledge-based algorithm PROSP II • More structural information, better prediction. • Incremental Learning • The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.

Ting-Yi Sung Wen-Lian Hsu Jia-Ming Chang Ei-Wen Yang Hsin-Nan Lin People

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method