480 likes | 637 Views
System approaches to the prediction of protein function. Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark brunak@cbs.dtu.dk www.cbs.dtu.dk. 40- 60% proteins of unknown function in the human genome.
E N D
System approaches to the prediction of protein function Søren Brunak Center for Biological Sequence Analysis Technical University of Denmark brunak@cbs.dtu.dk www.cbs.dtu.dk
Diverse functional categories of cell cycle regulated yeast proteins Level 1 GO categories for 349 cell cycle regulated yeast genes. Only 95 of these belong to the ”Cell Cycle” category (biological process).
Diverse functional categories for human nucleolus proteins Level 1 GO categories for 148 human genes located in the nucleolus. Only 5 of these belong to the ”Nucleolus” category (cellular component).
Pairwise alignment >carp Cyprinus carpio growth hormone 210 aa vs. >chicken Gallus gallus growth hormone 216 aa scoring matrix: BLOSUM50, gap penalties: -12/-2 40.6% identity; Global alignment score: 487 10 20 30 40 50 60 70 carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD :: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :. chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE 10 20 30 40 50 60 70 80 80 90 100 110 120 130 140 150 carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN : ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: . chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G 90 100 110 120 130 140 150 160 170 180 190 200 210 carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL .: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::. chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI 170 180 190 200 210
An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily
1AOZ (129 aa) vs. 1PLC (99 aa) scoring matrix: BLOSUM50, gap penalties: -12/-215.5% identity; Global alignment score: -23 10 20 30 40 50 601AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40 70 80 90 100 110 1201AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90 1AOZ VDPPQGKKE :. 1PLC VN-------
Transfer of functional information – in what space ? Recognize function in: Sequence space – sequence alignment Structure space – structural comparison Gene expression spaces – array data Interaction spaces – network/pathway extraction Paper space – text mining … Protein feature space
Predict orphan protein function in feature space • Orphan sequences have to use the standard cellular machinery for sorting, post-translational modification, etc. • Similar pattern of modification may imply similar function • Predict sequence attributes independently, e.g. local and global properties such as - post-translational modifications - localization signals - degradation signals - structure - composition, length, isoelectric point, …. • Then integrate and correlate using neural networks
Serine phosphorylation sites Acceptor site Pos. Target AKKG S EQES S-10 PKA (1CMK) GFGD S IEAQ S-87 Ovalbumin (1OVA) EVVG S AEAG S-350 Ovalbumin (1OVA) GDLG S CEFH S-80 Cystatin (1CEW)
Propeptide cleavage sites Post-translational processing by limited proteolysis of inactive secretory precursors produces active proteins and peptides Furin specific (a) and other proprotein convertase cleavage sites (b)
PCs activate a large variety of proteins Peptide hormones, neuropeptides, growth and differentiation factors, adhesion factors, receptors, blood coagulation factors, plasma proteins, extracellular matrix proteins, proteases, exogenous proteins such as coat glycoproteins from infectious viruses (e.g. HIV-1 and Influenza) and bacterial toxins (e.g. diphtheria and anthrax toxin). PCs play an essential role in many vital biological processes like embryonic development and neural function, and in viral and bacterial pathogenesis. PCs are implicated in pathologies such as cancer and neurodegenerative diseases.
Mucin-type O-glycosylation • N-acetylgalactosamine (GalNAc) a-1 linked to the hydroxyl group of a serine or threonine • Responsible for the high carbohydrate content of mucin proteins (>50% of the dry weight) • Mucins, principal component of mucus, protects epithelial surfaces from dehydration, mechanical injury, proteases and pathogens • Mucin-type glycosylation contributes to this by changing the structure to a stiff extended one and charging the protein to make it bind more water
Positional preference of N-Glyc sites across cellular role categories
Functional classes predicted • Functional role (Monica Riley categories) • The original scheme had 14 categories • Reduced to 12 categories by skipping the category ”other” and combining replication and transcription • Enzyme prediction • Enzyme vs non-enzyme • Major enzyme class in the EC system • Gene Ontology • A subset of classes can be predicted • Systems biology related categories • For example ’cell cycle regulated’, secreted, nucleolar
Predicting Gene Ontology categories • The GO system is designed for proteins to belong to multiple classes rather than one • Different kinds of function can be annotated: • Molecular function • Biological process • Cellular component • GO assigns the ”function” at several levels of detail rather than only one
The concept of ProtFun • Predict as many biologically relevant features as we can from the sequence • Train artificial neural networks for each category • Assign a probability for each category from the NN outputs
An enzyme (1AOZ) and a non-enzyme (1PLC) from the Cupredoxin superfamily
1AOZ and 1PLC predictions # Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052# Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690# Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017
Similar structure different functions • Many examples exist of structurally similar proteins which have different functions • Two PDB structures from the Cupredoxin superfamily • 1AOZ is an ascorbate oxidase (enzyme) • 1PLC is performing electron transport (non-enzyme) • Despite their structural similarity, our method predicts both correctly
Systems Biology – Whole system description • Focus on whole systems, rather than individual units • Requires identification of all units in the system • High diversity in biological systems • Inference of system features/functions from experimental data • Ultimate goal is in-silico modeling of the temporal aspects of the cell cycle in different organisms Example: Eukaryotic Cell Cycle
Periodic ? ? ? ? Non-Periodic Microarray identification of periodic genes Synchronous Yeast cells DNA chips Gene expression Temporal expression Look for those with a periodic expression
104 known genes 70% 91% 47% Identification of periodicly expressed genes 1) Visual inspection of expression profiles (Cho et al., 1998) 2) Fourier analysis and correlation with profiles of known genes (Spellman et al., 1998) 3) Statistical modeling (single pulse model) (Zhao et al., 2001) • Problems • Cho uses non-objective criteria • Spellman identifies too many genes • Zhao identifies less than half of previous identified cell cycle regulated genes
Our novel strategy Sequence based ’’machine learning approach’’ consistensy filter Periodic genes Positive set (97 sequences) { ? Grey zone area (~5600 gener) Learn Negative set (556 sequences) Non-periodic genes 6200 genes
Prediction of cell cycle regulated genes from protein sequence
Features of cell cycle regulated genes used by neural net ensemble
Non-linear function prediction! Responds to single AA change
Top 250 genes predicted from the entire genome • Among the ”top 250 predicted” genes not used for training are • 75 previous identified as cell cycle regulated genes • 175 new potentially cell cycle regulated genes Functional grouping Subcellular localization
Experimental validation results • More than 100 new periodic genes identified/validated • For many of them, a role in the cell cycle is supported by other sources of evidence • About 30% of them have no known functional role
The eukaryotic cell cycle The cell division process is divided into four phases: • G1 growth/synthesis • S replication of DNA • G2 growth/synthesis • M mitosis/cell division
S phase ? • 40% into the cell cycle the plots shows: • High isoelectric point • Many nuclear proteins • Short proteins • Low potential for N-glycosylation • Low potential for Ser/Thr-phosphorylation • Few PEST regions • Low aliphatic index S phase feature snapshot
Identify areas where prediction approaches can clean up noisyexperimental data • High-throughput proteomics data • DNA array data • Strength of prediction approaches can indeed be • complementary to the experimental data due to • experimental constraints • Generate hypotheses on the dynamics of • protein feature space, e.g. the periodicity of the • phospho-proteome.
People at CBS Lars Juhl Jensen Ramneek Gupta + 20 others Karin Julenius (O-glyc conservation) Thomas Skøt Jensen (cell cycle) Ulrik de Lichtenberg (cell cycle) Rasmus Wernersson (Febit experiments) Jannick Bendtsen (SecretomeP) Lars Kiemer (NucleolusP) Anders Fausbøll (NucleolusP) Thomas Schiritz-Ponten (new ProFun method) Febit AG Peer Smith CNB/CSIC, Madrid Alfonso Valencia Javier Tamames Damien Devos Gunnar von Heijne, Stockholm (SecretomeP) Acknowledgements
Referenceswww.cbs.dtu.dk/services/Protfunwww.cbs.dtu.dk/cellcycle L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Stærfeldt, K. Rapacki, C. Workman, C.A.F. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of human protein function from post-translational modifications and localization features", J. Mol. Biol., 319, 1257-1265, 2002. L.J. Jensen, M. Skovgaard, and S. Brunak, "Prediction of novel archaeal enzymes from sequence derived features", Protein Sci., 11, 2894-2898, 2002. L.J. Jensen, R. Gupta, H.-H. Stærfeldt, and S. Brunak, "Prediction of human protein function according to Gene Ontology categories", Bioinformatics, 19, 635-642, 2003. L.J. Jensen, D.W. Ussery, and S. Brunak, "Functionality of system components: Conservation of protein function in protein feature space", Genome Res., Oct 14, 2003. U. de Lichtenberg, T.S. Jensen, L.J. Jensen, and S. Brunak, Protein feature based identification of cell cycle regulated proteins in yeast, J. Mol. Biol., 13, 663-674, 2003.