1 / 146

Prediction of protein function

Prediction of protein function. Lars Juhl Jensen EMBL Heidelberg. Overview. Part 1 Homology-based transfer of annotation Function prediction from protein domains Part 2 Prediction of functional motifs from sequence Feature-based prediction of protein function Part 3

janet
Download Presentation

Prediction of protein function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction of protein function Lars Juhl JensenEMBL Heidelberg

  2. Overview • Part 1 • Homology-based transfer of annotation • Function prediction from protein domains • Part 2 • Prediction of functional motifs from sequence • Feature-based prediction of protein function • Part 3 • Prediction of functional interaction networks

  3. Why do we need to predict function?

  4. What do we mean by function? • The concept “function” is not clearly defined • A structural biologist, a cell biologist, and a medical doctor will have very different views • Many levels of granularity • For the overall definition of “function”, the knowledge and description can be more or less specific • Functional categories are somewhat artificial • People like to put things in boxes …

  5. Descriptions of protein function • Controlled vocabularies • Gene Ontology • SwissProt keywords • KEGG pathways • EcoCyc pathways • Interaction networks • More accurate data models • Reactome • Systems Biology Markup Language (SBML)

  6. Molecular function • Molecular function describes activities, such as catalytic or binding activities, at the molecular level • GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place • Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity

  7. Biological process • A biological process is series of events accomplished by one or more ordered assemblies of molecular functions • An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport • It can be difficult to distinguish between a biological process and a molecular function

  8. Cellular component • A cellular component is just that, a component of a cell that is part of some larger object • It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) • The cellular component categories are probably the best defined categories since they correspond to actual entities

  9. Homology-basedtransfer of annotation Lars Juhl JensenEMBL Heidelberg

  10. Detection of homologs • Pairwise sequence similarity searches • BLAST (fastest) • FASTA • Full Smith-Waterman (most sensitive) • Profile-based similarity searches • PSI-BLAST • Hidden Markov Models (HMMs) • Sequence similarity should always be evaluated at the protein level

  11. Sequence similarity, sequence homology, and functional homology • Sequence similarity means that the sequences are similar – no more, no less • Sequence homology implies that the proteins are encoded by genes that share a common ancestry • Functional homology means that two proteins from two organisms have the same function • Sequence similarity or sequence homology does not guarantee functional homology

  12. Orthologs vs. paralogs

  13. Functional consequencesof gene duplication • Neofunctionalization • One copy has retained the ancestral function and can be treated as a 1–to–1 ortholog (functional homolog) • The other copy have changed their function and behave much like paralogs • Subfunctionalization • Each copy has taken on a part of the ancestral function • A functional homolog cannot be defined • Each ortholog typically has the same molecular function in a different sub-process or location

  14. 1–to–1 orthology • A single gene in one organism corresponds to a single gene in another organism • These can generally be assumed to encode functionally equivalent proteins • Same molecular function • Same biological process • Same localization • 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms

  15. 1–to–many orthology • A single gene in one organism corresponds to multiple genes in another organism • Any mixture of neo- and sub-functionalizations can have occurred • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • 1–to–many orthology is very common between simple model organisms and higher eukaryotes

  16. Many–to–many orthology • Many genes in each organism have arisen from a single gene in their last common ancestor • Different neo- and sub-functionalizations have likely taken place in each lineage • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • Many–to–many orthology is common between higher eukaryotes that are distantly related

  17. Detection of orthologs • Reconstruction of phylogenetic trees • The theoretically most correct way • Works for analyzing particular genes of interest • Methods based on reciprocal matches • What currently works at the genomic scale • Manual curation • Detection of very remote orthologs may require that knowledge on gene synteny and/or protein function is taken into account

  18. Construction of gene trees • Identify the relevant proteins • Sequence similarity and possibly additional information • Construct a blocked multiple sequence alignment • Use, for example, Muscle and Gblocks • Reconstruct the most likely phylogenetic tree • Use, for example, PhyML • Orthologs and paralogs can be trivially extracted based on a gene tree

  19. Reciprocal matches • Simple “best reciprocal match” is a bad choice • Can only deal with one-to-one orthology • Detection of in-paralogs • Similarity higher with species than between species • Orthologs can now be detected based on best reciprocal matches between in-paralogous groups • One or more out-group organisms can optionally be used to improve the definition of orthologs

  20. Orthologous groups • Orthologs and paralogs are in principle always defined with respect to two organisms • Orthologous groups instead try to encompass an entire set of organisms • The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover

  21. Definition of orthologous groups

  22. COGs, KOGs, and NOGs • The COGs and KOGs were manually curated • These were automatically expanded to more species • Tri-clustering • Detection of in-paralogs • Identification of triangles of best reciprocal matches • Merging of triangles that share an edge • Broad phylogenetics coverage • COGs and NOGs cover all three domains of life • KOGs cover all eukaryotes

  23. Clustering based on similarity • All-against-all sequence similarity is calculated • A standard clustering method is applied to define groups of homologous genes • TribeMCL • Hierarchical clustering • These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs

  24. Meta-servers • Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge • These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest • However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies

  25. Function predictionfrom protein domains Lars Juhl JensenEMBL Heidelberg

  26. When homology searches fail • Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function • No functional information can thus be transferred based on simple sequence homology • By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function

  27. Protein domains • Many eukaryotic proteins consist of multiple globular domains that can fold independently • These domains have been mixed and matched through evolution • Each type of domain contributes towards the molecular function of the complete protein • Numerous resources are able to identify such domains from sequence alone using HMMs

  28. Which domain resource should I use? • SMART is focused on signal transduction domains • Pfam is very actively developed and thus tends to have the most up-to-date domain collection • InterPro is useful for genome annotation since the domains are annotated with GO terms • CDD is conveniently integrated with the NCBI BLAST web interface

  29. Predicting globular domains and intrinsically disordered regions • Not all globular domains have been discovered and the databases are thus not comprehensive • Methods exist for predicting from sequence which regions are globular and which are disordered • GlobPlot uses a simple propensity scale • DisEMBL, DISOPRED, and PONDR all use ensembles of artificial neural networks • Many disordered regions are important for protein function and they should thus not be ignored

More Related