280 likes | 532 Views
Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. M.W. Mak The Hong Kong Polytechnic University. S.Y. Kung Princeton University. Contents. Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction
E N D
Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University S.Y. Kung Princeton University
Contents • Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction • Conditional Random Field (CRF) CRF for Cleavage Site Prediction • Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP
Proteins and Their Destination • A protein consists of a sequence of amino acids. • Newly synthesized proteins need to pass across intra-cellular membrane to their destination. http://redpoll.pharmacy.ualberta.ca
Signal Peptide • A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein. • The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane. http://nobelprize.org Mature protein Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. Signal Peptide Cleavage Site
Importance of Cleavage Site Prediction • Defects in the protein sorting process can cause serious diseases, e.g., kidney stone Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html
Importance of Cleavage Site Prediction • Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide. Bioreactor Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html
Information in Sequences • Signal peptides contain some regular patterns. • Although the patterns exhibit substantial variation, they can be detected by machine learning tools. Rich in hydrophobic AA Cleavage Site
Existing Methods • Weight matrices (PrediSi) • Neural Networks (SignalP 1.1) • HMMs (SignalP 3.0)
Weight Matrices 15 Positions 20 AA t -1 t t+1 M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178
SignalP-HMM Source: Nielsen and Krogh Mature protein Signal Peptide
Contents • Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction • Conditional Random Field (CRF) CRF for Cleavage Site Prediction • Experiments and Results Effectiveness of Amino Acid Properties Effectiveness of Different Feature Functions Fusion with SignalP
Conditional Random Fields • Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging • Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.
Advantages of CRF • Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly. • Able to model long-range dependency without making the inference problem intractable. • Guarantee global optimal. Depends on M A R S S L F T F L C L A V F I N G C L S Q I E Q Q
CRF for Cleavage Cite Prediction Cleavage site Weights Length of Sequence Transition features n-grams of amino acids State features
CRF for Cleavage Cite Prediction e.g. bi-gram and query sequence = T Q T W A G S H S . . .
Contents • Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction • Conditional Random Field (CRF) CRF for Cleavage Site Prediction • Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP
Experiments • Data: 1937 protein sequences extracted from Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined • Ten-fold cross validation • For 1st-order state features, up to 5-grams of amino acids • For 2nd-order state features, up to bi-grams of amino acids. • Use CRF++ software
Results Effectiveness of Different Feature Functions: • Observations: • Transition feature by itself is no good. • But, once combined with state-features, performance improves (Transition only) (Transition + State)
Results Effect of Varying the Window Size: e.g. query sequence = T Q T W A G S H S . . .
Results Compared with Other Predictors Observations: (1) CRF is slightly better than SignalP (2) CRF is complementary to SignalP
Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp
Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp Available in May 2009
Conditional Random Fields • Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging • Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Observations x x y Labels