From Protein Sequence to Function:

Slide 1:From Protein Sequence to Function: Functional Analysis of Protein Sequences and Protein Classification Anastasia Nikolskaya Assistant Professor (Research) Protein Information Resource Department of Biochemistry and Molecular Biology Georgetown University Medical Center

Slide 2:Overview Role of Bioinformatics/Computational Biology in Proteomics Research Genomics Functional Annotation of Proteins Classification of Proteins Bioinformatics Databases and Analytical Tools: Dr. Mazumder and Dr. Hu Sequence function

Slide 3:Functional Genomics and Proteomics Proteomics studies biological systems based on global knowledge of protein sets (proteomes). Functional genomics studies biological functions of proteins, complexes, pathways based on the analysis of genome sequences. Includes functional assignments for protein sequences.

Slide 4:Proteomics Data: Gene Expression Profiling - Genome-Wide Analyses of Gene Expression Data: Structural Genomics - Determine 3D Structures of All Protein Families Data: Genome Projects (Sequencing) - Functional genomics - Knowing complete genome sequences of a number of organisms is the basis of the proteomics research

Slide 5:Bioinformatics and Genomics/Proteomics

Slide 7:Most new proteins come from genome sequencing projects Mycoplasma genitalium - 484 proteins Escherichia coli - 4,288 proteins S. cerevisiae (yeast) - 5,932 proteins C. elegans (worm) ~ 19,000 proteins Homo sapiens ~ 40,000 proteins

Slide 8:Advantages of knowing the complete genome sequence All encoded proteins can be predicted and identified The missing functions can be identified and analyzed Peculiarities and novelties in each organism can be studied Predictions can be made and verified

Slide 9:The changing face of protein science 20th century Few well-studied proteins Mostly globular with enzymatic activity Biased protein set 21st century Many �hypothetical� proteins Various, often with no enzymatic activity Natural protein set

Slide 10:Properties of the natural protein set Unexpected diversity of even common enzymes (analogous, paralogous, xenologous, enzymes) Conservation of the reaction chemistry, but not the substrate specificity Functional diversity in closely related proteins Abundance of new structures

Slide 13:Experimentally characterized Up-to-date information, manually annotated (curated database!) �Knowns� = Characterized by similarity (closely related to experimentally characterized) Make sure the assignment is plausible Function can be predicted Extract maximum possible information Avoid errors and overpredictions Fill the gaps in metabolic pathways �Unknowns� (conserved or unique) Rank by importance

Slide 14:Problems in functional assignments for �knowns� Previous low quality annotations

Slide 16:Problems in functional assignments for �knowns�

Slide 18:Experimentally characterized �Knowns� = Characterized by similarity (closely related to experimentally characterized) Make sure the assignment is plausible Function can be predicted Extract maximum possible information Avoid errors and overpredictions Fill the gaps in metabolic pathways �Unknowns� (conserved or unique) Rank by importance

Slide 19:Dealing with �hypothetical� proteins Computational analysis Sequence analysis of the new ORFs Structural analysis Determination of the 3D structure Mutational analysis Functional analysis Expression profiling Tracking of cellular localization

Slide 20:Cluster analysis of protein families (family databases) Use of sophisticated database searches (PSI-BLAST, HMM) Detailed manual analysis of sequence similarities Functional prediction: comutational analysis

Slide 21:Those amino acids that are conserved in divergent proteins (archaeal and bacterial, hyperthermophilic and mesophilic) are likely to be important for catalytic activity. Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise Prediction of the 3D fold and general biochemical function is much easier than prediction of exact biological (or biochemical) function.

Slide 22:Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition Sequence database searches that use exotic or highly divergent query sequences often reveal more subtle relationships than those using queries from humans or standard model organisms (E. coli, yeast, worm, fly). Sequence analysis complements structural comparisons and can greatly benefit from them

Slide 23:Poorly characterized protein families Enzyme activity can be predicted, the substrate remains unknown (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases) Helix-turn-helix motif proteins (predicted transcriptional regulators) Membrane transporters

Slide 24:Phylogenetic distribution Wide - most likely essential Narrow - probably clade-specific Patchy - most intriguing, niche-specific Domain association � Rosetta Stone (for multidomain proteins) Gene neighborhood (operon organization) Functional prediction: computational analysis

Slide 26:Functional Prediction:Role of Structural Genomics Protein Structure Initiative: Determine 3D Structures of All Proteins Family Classification: Organize Protein Sequences into Families, collect families without known structures Target Selection: Select Family Representatives as Targets Structure Determination: X-Ray Crystallography or NMR Spectroscopy Homology Modeling: Build Models for Other Proteins by Homology Functional prediction based on structure Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome. Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome.

Slide 27:Structural Genomics: Structure-Based Functional Assignments Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome. Structural genomics is the systematic determination of 3-dimensional structures of proteins representative of the range of protein structure and function found in nature. The aim, ultimately, is to build a body of structural information that will facilitate prediction of a reasonable structure and potential function for almost any protein from knowledge of its coding sequence. Such information will be essential for understanding the functioning of the human proteome, the ensemble of tens of thousands of proteins specified by the human genome.

Slide 29:Functional prediction: problem areas Identification of protein-coding regions Delineation of potential function(s) for distant paralogs Identification of domains in the absense of close homologs Analysis of proteins with low sequence complexity

Slide 30:Experimentally characterized Up-to-date information, manually annotated �Knowns� = Characterized by similarity (closely related to experimentally characterized) Make sure the assignment is plausible Function can be predicted Extract maximum possible information Avoid errors and overpredictions Fill the gaps in metabolic pathways �Unknowns� (conserved or unique) Rank by importance

Slide 31:�Unknown unknowns� Phylogenetic distribution Wide - most likely essential Narrow - probably clade-specific Patchy - most intriguing, niche-specific

Slide 32:To deal with the ocean of new sequences, need �natural� protein classification Protein families are real and reflect evolutionary relationships Function often follows along the family lines Protein classification systems can be used to: Improve sensitivity of protein identification, simplify detection of non-obvious relationships Provide accurate automatic annotation for new sequences Detect and correct genome annotation errors systematically Drive other annotations (actve site etc) Provide basis for evolutionary and comparative research

Slide 33:The ideal system would be: Comprehensive, with each sequence classified either as a member of a family or as an �orphan� sequence Hierarchical and based on evolution, with families united into superfamilies on the basis of distant homology Allow for simultaneous use of the whole protein and domain information (domains mapped onto proteins) Allow for automatic classification/annotation of new sequences Expertly curated (family name, function, evidence attribution (experimental vs predicted), background etc). This is the only way to avoid annotation errors and prevent error propagation

Slide 34:Protein Evolution Tree of Life & Evolution of Protein Families (Dayhoff, 1978) Can build a tree representing evolution of a protein family, based on sequences Othologus Gene Family: Organismal and Sequence Trees Match Well

Slide 35:Protein Evolution Homolog Common Ancestors Common 3D Structure Usually at least some sequence similarity (sequence motifs or more close similarity) Ortholog Derived from Speciation Paralog Derived from Duplication Homology: Similarity in DNA or protein sequences between individuals of the same species or among different species. Evolutionary approaches, including cross species sequence comparisons and studies on features of genome organization, evolution and conserved synteny are critical. Homology: Similarity in DNA or protein sequences between individuals of the same species or among different species. Evolutionary approaches, including cross species sequence comparisons and studies on features of genome organization, evolution and conserved synteny are critical.

Slide 36:Orthologs and Paralogs

Slide 37:Orthologs and Paralogs

Slide 38:Levels of Protein Classification

Slide 39:Protein Family-Domain-Motif Domain: Evolutionary/Functional/Structural Unit Domain = structurally compact, independently folding unit that forms a stable three-dimentional structure and shows a certain level of evolutionary conservation. Usually, corresponds to an evolutionary unit. A protein can consist of a single domain or multiple domains. Proteins have modular structure. Motif: Conserved Functional/Structural Site Here, for example, are diagrammatic representations of 5 superfamilies containing the calcineurin-like phosphoesterase domain, in several cases in association with other known types of domains.Here, for example, are diagrammatic representations of 5 superfamilies containing the calcineurin-like phosphoesterase domain, in several cases in association with other known types of domains.

Slide 40:Protein Evolution:Sequence Change vs. Domain Shuffling

Slide 41:Recent Domain Shuffling

Slide 42:Practical classification of proteins:setting realistic goals

Slide 43:Complementary approaches Classify domains Allows to build a hierarchy and trace evolution all the way to the deepest possible level, the last point of traceable homology and common origin Can usually annotate only generic biochemical function Classify whole proteins Does not allow to build a hierarchy deep along the evolutionary tree because of domain shuffling Can usually annotate specific biological function (value for the user and for the automatic individual protein annotation)

Slide 44:Protein Family Databases to be discussed PIRSF: Proteins in PIRPSD: 283,289 Proteins �classified: � 187,871 2/3 of the PIR proteins InterPro COGs/KOGs ~ 70% of each microbial genome ~ 50% of each Eukaryotic genome in 3-clade KOG ~ 20% ? of each Eukaryotic genome in LSEs

Slide 45:PIR Web Site (http://pir.georgetown.edu)

Slide 46:PIRSF protein classification system

Slide 47:Whole protein functional annotation [NiFe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme SF006256 Acylphosphatase � Znf x2 � YrdC - On the basis of domain composition alone: can not tell that this is a hydrogenase maturation factor various speculations for exact role: RNA-binding translation factor, maturation protease �. Recent data: carbamoyl phosphate-converting enzyme

Slide 48:PIRSF classification is based on evolution Sequence similarity, domain architecture and phyletic pattern guide PIRSF creation/curation PIRSF Families range from ancient conserved proteins to unique lineage specific expansions PIRSF Families are monophyletic (with some exceptions) Members are homologs (may be orthologs or paralogs)

Slide 49:Levels of protein classification

Slide 50:PIRSF curation and annotation Preliminary clusters Computationally generated, not curated Curated Families (preliminary curation, first-tier annotation) Membership, with no reclustering Regular members: seed, representative Associate members Signature domains & HMM thresholds Fully curated Families (second-tier annotation) Membership Family name Description, bibliography (optional) Integrated into InterPro Recently, the curation interface has been enhanced.Recently, the curation interface has been enhanced.

Slide 51:iProClass PIRSF report

Slide 52:Systematic correction of annotation errors:Chorismate mutase Chorismate Mutase (CM), AroQ class PIRSF001501 � CM (Prokaryotic type) [PF01817] PIRSF001499 � TyrA bifunctional enzyme (Prok) [PF01817-PF02153] PIRSF001500 � PheA bifunctional enzyme (Prok) [PF01817-PF00800] PIRSF017318 � CM (Eukaryotic type) [Regulatory Dom-PF01817] Chorismate Mutase, AroH class PIRSF005965 � CM [PF01817] CSM: Class: All alpha proteins Fold: Chorismate mutase II multihelical; core: 6 helices, bundle 1DBF: Class: Alpha and beta proteins (a+b) Mainly antiparallel beta sheets (segregated alpha and beta regions) Fold: Bacillus chorismate mutase-like core: beta-alpha-beta-alpha-beta(2); mixed beta-sheet: order: 1423, strand 4 is antiparallel to the rest CSM: Class: All alpha proteins Fold: Chorismate mutase II multihelical; core: 6 helices, bundle 1DBF: Class: Alpha and beta proteins (a+b) Mainly antiparallel beta sheets (segregated alpha and beta regions) Fold: Bacillus chorismate mutase-like core: beta-alpha-beta-alpha-beta(2); mixed beta-sheet: order: 1423, strand 4 is antiparallel to the rest

Slide 53:Chorismate Mutase Convergent Evolution � EC 5.4.99.5 (Non-Orthologous Gene Displacement) Two Distinct Sequence/Structure Types AroQ Class: SCOP (all a), core: 6 helices, bundle AroH Class: SCOP (a+b), core: beta-alpha-beta-alpha-beta(2) Pfam Domain: PF01817 CSM: Class: All alpha proteins Fold: Chorismate mutase II multihelical; core: 6 helices, bundle 1DBF: Class: Alpha and beta proteins (a+b) Mainly antiparallel beta sheets (segregated alpha and beta regions) Fold: Bacillus chorismate mutase-like core: beta-alpha-beta-alpha-beta(2); mixed beta-sheet: order: 1423, strand 4 is antiparallel to the rest CSM: Class: All alpha proteins Fold: Chorismate mutase II multihelical; core: 6 helices, bundle 1DBF: Class: Alpha and beta proteins (a+b) Mainly antiparallel beta sheets (segregated alpha and beta regions) Fold: Bacillus chorismate mutase-like core: beta-alpha-beta-alpha-beta(2); mixed beta-sheet: order: 1423, strand 4 is antiparallel to the rest

Slide 54:Systematic correction of annotation errors:IMPDH

Slide 55:Propagation of protein annotationwithin UniProt (under development) Homeomorphic family level is the primary PIRSF curation and annotation level, most invested with biological meaning Reliable automatic assignment of new family members Systematic detection of annotation errors (curator looks at every protein annotation in the family) PIRSF Family/Subfamily name - can be applied to every member - make possible the automatic transfer of name from PIRSF to every unnamed/unannotated member in UniProt Preventing error propagation: evidence attribution experimental (validated and tentative) and predicted

Slide 56:InterPro(at EBI)

Slide 57:InterPro Entry

Slide 58:PIR Superfamilies are being integrated into InterPro

Slide 59:COGs (Clusters of Orthologous Groups)(at NCBI) Complete genomes Reciprocal best hits No score cutoffs Comparative genomics � a branch of computational biology that uses complete genome sequences

Slide 61:Construction of COGs:

Slide 63:Construction of COGs:Add all homologs

Slide 69:What to do with a new porotein sequence I. Basic: - Domain analysis (SMART = recommended; PFAM, CDD is included in BLAST output) BLAST Curated protein family databases (PIRSF, InterPro, COGs) II. If not sufficient: Literature (PubMed) from links from individual entries on the BLAST output (look for SwissProt entries first) Refined PubMed search using gene/protein names, synomims, function and other terms you found III. Advanced: - Multiple sequence alignments - Phylogenetic tree reconstruction

Slide 71:Examples for analysis: 1. Retrieve one of the following protein sequences: PIR: C69086 D64376 GenBank GI:15679635. Using analysis tools available on the web, check if the functional annotation is correct, and provide correct annotation without looking at internal PIR or COG annotations (Run BLAST with CDsearch and SMART to start with). When you are done, look at the PIR curated SF annotation, and at COG annotations. What caused the wrong annotations? In BLAST outputs for these sequences, do you see other wrongly annotated proteins (same and different type errors)? Next, analyze the C-terminal domain of these proteins by PSI-BLAST (and alignment analysis) and suggest any speculations as to its function (homework). �

Slide 72:Examples for analysis: 2. Retrieve the following sequence: GI:7019521 Take a look at the associated publication (reference). Analyze the sequence to see if any additional information can be obtained (run PSI-BLAST, and (as a homework) construct multiple alignment). Take a look at taxonomy report: what does it tell you? Find experimental paper associated with one of the sequences found by PSI-BLAST. What annotation is appropriate for this sequence and for the entire family?

Slide 73:Examples for analysis: 3. Predict the function of the following proteins: GenBank: GI: 27716853 E. coli YjeE protein Verify and/or correct the following functional annotations. Can you explain why the erroneous annotations were made? PIR: H87387 GenBank: GI:15606003 GI:15807219 PIR: F70338

Slide 74:Examples for analysis: 4. Homework: an exercise in transitive relationships:Start with>gi|20093648|ref|NP_613495.1| Uncharacterized membrane protein, conserved in Archaea [Methanopyrus kandleri AV19](this is a short membrane protein); run PSI-BLAST, make sure you have filtering, complexity and CD-search off. There are no good hits but a bunch of sub-threshold ones. Collect "suspect" relations, use them as queries and expand the net. You will be able to come up with two proteins:>gi|21227474|ref|NP_633396.1| hypothetical protein [Methanosarcina mazei Goe1] and>gi|14324537|dbj|BAB59464.1| hypothetical protein [Thermoplasma volcanium]When used as a PSI-BLAST query, the first will tie the Methanopyrus protein into a group, while the second will tie this group to the Sec61 subunit of preprotein translocase.Then, of course, you can obtain the same result with CD-search in a single step ?.

From Protein Sequence to Function:

From Protein Sequence to Function:

Presentation Transcript

Protein Structure and Enzyme Function

Tools to analyze protein characteristics

Protein Folding Programs

Protein structure Visualization Molecular Story

Building an Invisible Puzzle: Predicting Protein Structure and Function from Sequence

Tools to analyze protein characteristics

Coursework Sequence Operation

2d-3D Structure Modelling

Molecular Graphics Perspective of Protein Structure and Function

Prediction of protein structure

Sequence-Structure-Function

Protein Structure and Function

Prediction of protein disorder

Chap.3 Protein Structure & Function

Protein Folding Programs

Protein Structure and Function

Centre for Integrative Bioinformatics VU (IBIVU)

Classification of protein and domain families

Proteins Determine Function

Some Specific “Informatics” tools of Bioinformatics

Proteomics technologies and protein-protein interaction

Protein-protein interactions

From Protein Sequence to Function: