1.61k likes | 2.08k Views
Prediction of protein function. Lars Juhl Jensen EMBL Heidelberg. Overview. Part 1 Homology-based transfer of annotation Function prediction from protein domains Part 2 Prediction of functional motifs from sequence Feature-based prediction of protein function Part 3
E N D
Prediction of protein function Lars Juhl JensenEMBL Heidelberg
Overview • Part 1 • Homology-based transfer of annotation • Function prediction from protein domains • Part 2 • Prediction of functional motifs from sequence • Feature-based prediction of protein function • Part 3 • Prediction of functional interaction networks
What do we mean by function? • The concept “function” is not clearly defined • A structural biologist, a cell biologist, and a medical doctor will have very different views • Many levels of granularity • For the overall definition of “function”, the knowledge and description can be more or less specific • Functional categories are somewhat artificial • People like to put things in boxes …
Descriptions of protein function • Controlled vocabularies • Gene Ontology • SwissProt keywords • KEGG pathways • EcoCyc pathways • Interaction networks • More accurate data models • Reactome • Systems Biology Markup Language (SBML)
Molecular function • Molecular function describes activities, such as catalytic or binding activities, at the molecular level • GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place • Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity
Biological process • A biological process is series of events accomplished by one or more ordered assemblies of molecular functions • An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport • It can be difficult to distinguish between a biological process and a molecular function
Cellular component • A cellular component is just that, a component of a cell that is part of some larger object • It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) • The cellular component categories are probably the best defined categories since they correspond to actual entities
Homology-basedtransfer of annotation Lars Juhl JensenEMBL Heidelberg
Detection of homologs • Pairwise sequence similarity searches • BLAST (fastest) • FASTA • Full Smith-Waterman (most sensitive) • Profile-based similarity searches • PSI-BLAST • Hidden Markov Models (HMMs) • Sequence similarity should always be evaluated at the protein level
Sequence similarity, sequence homology, and functional homology • Sequence similarity means that the sequences are similar – no more, no less • Sequence homology implies that the proteins are encoded by genes that share a common ancestry • Functional homology means that two proteins from two organisms have the same function • Sequence similarity or sequence homology does not guarantee functional homology
Functional consequencesof gene duplication • Neofunctionalization • One copy has retained the ancestral function and can be treated as a 1–to–1 ortholog (functional homolog) • The other copy have changed their function and behave much like paralogs • Subfunctionalization • Each copy has taken on a part of the ancestral function • A functional homolog cannot be defined • Each ortholog typically has the same molecular function in a different sub-process or location
1–to–1 orthology • A single gene in one organism corresponds to a single gene in another organism • These can generally be assumed to encode functionally equivalent proteins • Same molecular function • Same biological process • Same localization • 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms
1–to–many orthology • A single gene in one organism corresponds to multiple genes in another organism • Any mixture of neo- and sub-functionalizations can have occurred • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • 1–to–many orthology is very common between simple model organisms and higher eukaryotes
Many–to–many orthology • Many genes in each organism have arisen from a single gene in their last common ancestor • Different neo- and sub-functionalizations have likely taken place in each lineage • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • Many–to–many orthology is common between higher eukaryotes that are distantly related
Detection of orthologs • Reconstruction of phylogenetic trees • The theoretically most correct way • Works for analyzing particular genes of interest • Methods based on reciprocal matches • What currently works at the genomic scale • Manual curation • Detection of very remote orthologs may require that knowledge on gene synteny and/or protein function is taken into account
Construction of gene trees • Identify the relevant proteins • Sequence similarity and possibly additional information • Construct a blocked multiple sequence alignment • Use, for example, Muscle and Gblocks • Reconstruct the most likely phylogenetic tree • Use, for example, PhyML • Orthologs and paralogs can be trivially extracted based on a gene tree
Reciprocal matches • Simple “best reciprocal match” is a bad choice • Can only deal with one-to-one orthology • Detection of in-paralogs • Similarity higher with species than between species • Orthologs can now be detected based on best reciprocal matches between in-paralogous groups • One or more out-group organisms can optionally be used to improve the definition of orthologs
Orthologous groups • Orthologs and paralogs are in principle always defined with respect to two organisms • Orthologous groups instead try to encompass an entire set of organisms • The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover
COGs, KOGs, and NOGs • The COGs and KOGs were manually curated • These were automatically expanded to more species • Tri-clustering • Detection of in-paralogs • Identification of triangles of best reciprocal matches • Merging of triangles that share an edge • Broad phylogenetics coverage • COGs and NOGs cover all three domains of life • KOGs cover all eukaryotes
Clustering based on similarity • All-against-all sequence similarity is calculated • A standard clustering method is applied to define groups of homologous genes • TribeMCL • Hierarchical clustering • These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs
Meta-servers • Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge • These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest • However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies
Function predictionfrom protein domains Lars Juhl JensenEMBL Heidelberg
When homology searches fail • Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function • No functional information can thus be transferred based on simple sequence homology • By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function
Protein domains • Many eukaryotic proteins consist of multiple globular domains that can fold independently • These domains have been mixed and matched through evolution • Each type of domain contributes towards the molecular function of the complete protein • Numerous resources are able to identify such domains from sequence alone using HMMs
Which domain resource should I use? • SMART is focused on signal transduction domains • Pfam is very actively developed and thus tends to have the most up-to-date domain collection • InterPro is useful for genome annotation since the domains are annotated with GO terms • CDD is conveniently integrated with the NCBI BLAST web interface
Predicting globular domains and intrinsically disordered regions • Not all globular domains have been discovered and the databases are thus not comprehensive • Methods exist for predicting from sequence which regions are globular and which are disordered • GlobPlot uses a simple propensity scale • DisEMBL, DISOPRED, and PONDR all use ensembles of artificial neural networks • Many disordered regions are important for protein function and they should thus not be ignored