Prediction of protein function

Prediction of protein function Lars Juhl JensenEMBL Heidelberg

Overview • Part 1 • Homology-based transfer of annotation • Function prediction from protein domains • Part 2 • Prediction of functional motifs from sequence • Feature-based prediction of protein function • Part 3 • Prediction of functional interaction networks

Why do we need to predict function?

What do we mean by function? • The concept “function” is not clearly defined • A structural biologist, a cell biologist, and a medical doctor will have very different views • Many levels of granularity • For the overall definition of “function”, the knowledge and description can be more or less specific • Functional categories are somewhat artificial • People like to put things in boxes …

Descriptions of protein function • Controlled vocabularies • Gene Ontology • SwissProt keywords • KEGG pathways • EcoCyc pathways • Interaction networks • More accurate data models • Reactome • Systems Biology Markup Language (SBML)

Molecular function • Molecular function describes activities, such as catalytic or binding activities, at the molecular level • GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place • Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity

Biological process • A biological process is series of events accomplished by one or more ordered assemblies of molecular functions • An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport • It can be difficult to distinguish between a biological process and a molecular function

Cellular component • A cellular component is just that, a component of a cell that is part of some larger object • It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) • The cellular component categories are probably the best defined categories since they correspond to actual entities

Homology-basedtransfer of annotation Lars Juhl JensenEMBL Heidelberg

Detection of homologs • Pairwise sequence similarity searches • BLAST (fastest) • FASTA • Full Smith-Waterman (most sensitive) • Profile-based similarity searches • PSI-BLAST • Hidden Markov Models (HMMs) • Sequence similarity should always be evaluated at the protein level

Sequence similarity, sequence homology, and functional homology • Sequence similarity means that the sequences are similar – no more, no less • Sequence homology implies that the proteins are encoded by genes that share a common ancestry • Functional homology means that two proteins from two organisms have the same function • Sequence similarity or sequence homology does not guarantee functional homology

Orthologs vs. paralogs

Functional consequencesof gene duplication • Neofunctionalization • One copy has retained the ancestral function and can be treated as a 1–to–1 ortholog (functional homolog) • The other copy have changed their function and behave much like paralogs • Subfunctionalization • Each copy has taken on a part of the ancestral function • A functional homolog cannot be defined • Each ortholog typically has the same molecular function in a different sub-process or location

1–to–1 orthology • A single gene in one organism corresponds to a single gene in another organism • These can generally be assumed to encode functionally equivalent proteins • Same molecular function • Same biological process • Same localization • 1–to–1 orthology is fairly common in prokaryotes and among very closely related organisms

1–to–many orthology • A single gene in one organism corresponds to multiple genes in another organism • Any mixture of neo- and sub-functionalizations can have occurred • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • 1–to–many orthology is very common between simple model organisms and higher eukaryotes

Many–to–many orthology • Many genes in each organism have arisen from a single gene in their last common ancestor • Different neo- and sub-functionalizations have likely taken place in each lineage • Typically same molecular function • Often different biological process or sub-process • Often different sub-cellular localization or tissue • Many–to–many orthology is common between higher eukaryotes that are distantly related

Detection of orthologs • Reconstruction of phylogenetic trees • The theoretically most correct way • Works for analyzing particular genes of interest • Methods based on reciprocal matches • What currently works at the genomic scale • Manual curation • Detection of very remote orthologs may require that knowledge on gene synteny and/or protein function is taken into account

Construction of gene trees • Identify the relevant proteins • Sequence similarity and possibly additional information • Construct a blocked multiple sequence alignment • Use, for example, Muscle and Gblocks • Reconstruct the most likely phylogenetic tree • Use, for example, PhyML • Orthologs and paralogs can be trivially extracted based on a gene tree

Reciprocal matches • Simple “best reciprocal match” is a bad choice • Can only deal with one-to-one orthology • Detection of in-paralogs • Similarity higher with species than between species • Orthologs can now be detected based on best reciprocal matches between in-paralogous groups • One or more out-group organisms can optionally be used to improve the definition of orthologs

Orthologous groups • Orthologs and paralogs are in principle always defined with respect to two organisms • Orthologous groups instead try to encompass an entire set of organisms • The “inclusiveness” of the orthologous groups depends on how broad a set of organisms the groups cover

Definition of orthologous groups

COGs, KOGs, and NOGs • The COGs and KOGs were manually curated • These were automatically expanded to more species • Tri-clustering • Detection of in-paralogs • Identification of triangles of best reciprocal matches • Merging of triangles that share an edge • Broad phylogenetics coverage • COGs and NOGs cover all three domains of life • KOGs cover all eukaryotes

Clustering based on similarity • All-against-all sequence similarity is calculated • A standard clustering method is applied to define groups of homologous genes • TribeMCL • Hierarchical clustering • These methods generally detect groups of homologous genes, but are not good for distinguishing between orthologs and paralogs

Meta-servers • Since numerous methods exist for identifying groups of orthologous proteins, meta-servers have begun to emerge • These can be very useful for “fishing expeditions” where one is looking for a remote ortholog of a particular protein of interest • However, such meta-servers do not attempt to unify the different orthologous groups and are thus not useful for genome-wide studies

Function predictionfrom protein domains Lars Juhl JensenEMBL Heidelberg

When homology searches fail • Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function • No functional information can thus be transferred based on simple sequence homology • By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function

Protein domains • Many eukaryotic proteins consist of multiple globular domains that can fold independently • These domains have been mixed and matched through evolution • Each type of domain contributes towards the molecular function of the complete protein • Numerous resources are able to identify such domains from sequence alone using HMMs

Which domain resource should I use? • SMART is focused on signal transduction domains • Pfam is very actively developed and thus tends to have the most up-to-date domain collection • InterPro is useful for genome annotation since the domains are annotated with GO terms • CDD is conveniently integrated with the NCBI BLAST web interface

Predicting globular domains and intrinsically disordered regions • Not all globular domains have been discovered and the databases are thus not comprehensive • Methods exist for predicting from sequence which regions are globular and which are disordered • GlobPlot uses a simple propensity scale • DisEMBL, DISOPRED, and PONDR all use ensembles of artificial neural networks • Many disordered regions are important for protein function and they should thus not be ignored

Prediction of protein function

Prediction of protein function

Presentation Transcript

Protein Molecular Function Prediction by Bayesian Phylogenomics

Function Prediction from Protein Sequence

Prediction of protein structure

Protein Function

DNA/Protein structure-function analysis and prediction

Analysis and Prediction of Protein Function

DNA/Protein structure-function analysis and prediction

DNA/Protein structure-function analysis and prediction

System approaches to the prediction of protein function

Lecture 3 Protein Function prediction using network concepts

DNA/Protein structure-function analysis and prediction

Protein Structure and Function Prediction

Prediction of protein disorder

Consistent probabilistic outputs for protein function prediction

Prediction of protein disorder

DNA/Protein structure-function analysis and prediction

Protein Function

Protein Function Prediction Based on Domain Content

Protein Function Prediction

Protein Function Prediction from Protein Interactions

Biological Signal Detection for Protein Function Prediction

Function of Protein