Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND FAMILY CLASSIFICATION Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Problem: • Most new protein sequences come from genome sequencing projects • Many have unknown functions • Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect Overview Functional Analysis of Protein Sequences: • Homology-based (sequence analysis, structure analysis) • Non-homology (genome context, phylogenetic distribution) Solution for Large-scale Annotation: • Highly curated and annotated protein classification system • Automatic annotation of sequences based on protein families PIRSF Protein Classification System • Whole-protein family classification based on evolution • Highly annotated, optimized for annotation propagation • Functional predictions for uncharacterized proteins • Used to facilitate and standardize annotations in UniProt

Proteomics and Bioinformatics • Data: Gene expression profilingGenome-wide analysis of gene expression • Data: Protein-protein interaction • Data: Structural genomics3D structures of all protein families • Data: Genome projects (Sequencing) • …. Bioinformatics Computational analysis and integration of these data Making predictions (function etc), reconstructing pathways

What’s In It For Me? • When an experiment yields a sequence (or a set of sequences), we need to find out as much as we can about this protein and its possible function from available data • Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins • More challenging for large sets of sequences generated by large-scale proteomics experiments • The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments Sequence function

Genomic DNA Sequence Gene Gene Gene Recognition Exon1 Promoter 5' UTR Intron Exon2 Exon3 3' UTR Intron A C C T A G A G A A T A A A T T G G T C A T G A A T A A A Protein Sequence Exon1 Exon2 Exon3 Structure Determination Function Analysis Family Classification Protein Family Molecular Evolution Gene Network Metabolic Pathway Protein Structure Work with Protein, not DNA Sequence DNA Sequence Gene Protein Sequence Function

20th century Few well-studied proteins Mostly globular with enzymatic activity Biased protein set 21st century Many “hypothetical” proteins (Most new proteins come from genome sequencing projects, many have unknown functions) Various, often with no enzymatic activity Natural protein set The Changing Face of Protein Science Credit: Dr. M. Galperin, NCBI

Knowing the Complete Genome Sequence Advantages: • All encoded proteins can be predicted and identified • The missing functions can be identified and analyzed • Peculiarities and novelties in each organism can be studied • Predictions can be made and verified Challenge: • Accurate assignment of known or predicted functions (functional annotation)

E. coli M. jannaschii S. cerevisiae H. sapiens Characterized experimentally 2046 97 3307 10189 Characterized by similarity 1083 1025 1055 10901 Unknown, conserved 285 211 1007 2723 Unknown, no similarity 874 411 966 7965 from Koonin and Galperin, 2003, with modifications

Functional Annotationfor Different Groups of Proteins • Experimentally characterized • Find up-to-date information, accurate interpretation • Characterized by similarity (“knowns”) =closely related to experimentally characterized • Avoid propagation of errors • Function can be predicted (no close sequence similarity, may be distant similarity to characterized proteins) • Extract maximum possible information, avoid errors and overpredictions • Most value-added (fill the gaps in metabolic pathways, etc) • “Unknowns” (conserved or unique) • Rank by importance

How are Protein Sequences Annotated? “regular approach” Protein Sequence Function Automatic assignmentbased on sequence similarity (best BLAST hit): gene name, protein name, function Large-scale functional annotation of sequences based simply on BLAST best hit has pitfalls; results are far from perfect To avoid mistakes, need human intervention (manual annotation) Quality vs Quantity

Problems in Functional Assignments for “Knowns” • Misinterpreted experimental results (e.g. suppressors, cofactors) • Biologically senseless annotations Arabidopsis: separation anxiety protein-like Helicobacter: brute force protein Methanococcus: centromere-binding protein Plasmodium: frameshift • “Goofy” mistakes of sequence comparison (e.g. abc1/ABC) • Multi-domain organization of proteins • Low sequence complexity (coiled-coil, transmembrane, non-globular regions) • Enzyme evolution: • Divergence in sequence and function (minor mutation in active site) • Non-orthologous gene displacement: Convergent evolution

Problems in Functional Assignments for “Knowns”:multi-domain organization of proteins ACT domain New sequence BLAST Chorismate mutase Chorismate mutase domain ACT domain In BLAST output, top hits are to chorismate mutases -> The name “chorismate mutase” is automatically assigned to new sequence.ERROR ! (protein gets erroneous name, EC number, assigned to erroneous pathway, etc)

Problems in Functional Assignments for “Knowns” Previous low quality annotations lead to propagation of mistakes

Functional Prediction:I. Sequence and Structure Analysis (homology-based methods) in non-obvious cases: • Sophisticated database searches (PSI-BLAST, HMM) • Detailed manual analysis of sequence similarities • Structure-guided alignments and structure analysis Often, only general function can be predicted: • Enzyme activity can be predicted, the substrate remains unknown(ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases) • Helix-turn-helix motif proteins(predicted transcriptional regulators) • Membrane transporters

Using Sequence Analysis: Hints • Proteins (domains) with different 3D folds are not homologous (unrelated by origin). Proteins with similar 3D folds are usually (but not always) homologous • Those amino acids that are conserved in divergent proteins within a (super)family are likely to be functionally important (catalytic or binding sites, ect). • Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition

Using Sequence Analysis: Hints • Prediction of 3D fold (if distant homologs have known structures!) and of general biochemical function is much easier than prediction of exact biological function • Sequence analysis complements structuralcomparisons and can greatly benefit from them • Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise Credit: Dr. M. Galperin, NCBI

Functional Prediction:Role of Structural Genomics Protein Structure Initiative: Determine 3D Structures of All Proteins • Family Classification: Organize protein sequences into families, collect families without known structures • Target Selection: Select family representatives as targets • Structure Determination: X-Ray crystallography or NMR spectroscopy • Homology Modeling: Build models for other proteins by homology • Attempt functional prediction based on structure

Structural Genomics: Structure-Based Functional Predictions Methanococcus jannaschii MJ0577 (Hypothetical Protein) Contains bound ATP => ATPase or ATP-Mediated Molecular Switch Confirmed by biochemical experiments

Crystal Structure is Not a Function! Credit: Dr. M. Galperin, NCBI

Functional Prediction:II. Computational Analysis Beyond Homology • Phylogenetic distribution (comparative genomics) • Wide - most likely essential • Narrow - probably clade-specific • Patchy - most intriguing • Domain association – “Rosetta Stone” • Genome context (gene neighborhood,operonorganization) Clues: specific to niche, pathway type

Using Genome Context for Functional Prediction SEED analysis tool (by FIG) Embden-Meyerhof and Gluconeogenesis pathway: 6-phosphofructokinase (EC 2.7.1.11)

Functional Prediction: Problem Areas • Identification of protein-coding regions • Delineation of potential function(s) for distant paralogs • Identification of domains in the absence of close homologs • Analysis of proteins with low sequence complexity

Case Study: Prediction Verified: GGDEF domain • Proteins containing this domain: Caulobacter crescentus PleD controls swarmer cell - stalk cell transition (Hecht and Newton, 1995). In Rhizobium leguminosarum, Acetobacter xylinum, required for cellulose biosynthesis (regulation) • Predicted to be involved in signal transduction because it is found in fusions with other signaling domains (receiver, etc) • In Acetobacter xylinum, cyclic di-GMP is a specific nucleotide regulator of cellulose synthase (signalling molecule). Multidomain protein with GGDEF domain was shown to have diguanylate cyclase activity (Tal et al., 1998) • Detailed sequence analysis tentatively predicts GGDEF to be a diguanylate cyclase domain (Pei and Grishin, 2001) • Complementation experiments prove diguanylate cyclase activity of GGDEF (Ausmees et al., 2001)

Facilitates: • Automatic annotation of sequences based on protein families • Systematic correction of annotation errors • Protein name standardization • Functional predictions for uncharacterized proteins The Need for Classification Problem: • Most new protein sequences come from genome sequencing projects • Many have unknown functions • Large-scale functional annotation of these sequences based simply on BLAST best hit has pitfalls; results are far from perfect • Manual annotation of individual proteins is not efficient Solution: • Highly curated and annotated protein classification system • Automatic annotation of sequences based on protein families This all works only if the system is optimized for annotation

Levels of Protein Classification

Protein Evolution Domain: Evolutionary/Functional/Structural Unit Domain shuffling Sequence changes With enough similarity, one can trace back to a common origin What about these?

CM? PDH? PDT? CM/PDH? CM/PDT? Consequences of Domain Shuffling PIRSF006786 PIRSF001501 CM = chorismate mutase PDH = prephenate dehydrogenase PDT = prephenate dehydratase ACT = regulatory domain CM (AroQ type) PDH CM (AroQ type) PDH PIRSF001499 ACT PDH PIRSF005547 PDT ACT PIRSF001424 CM (AroQ type) PDT ACT PIRSF001500

- - - - Acylphosphatase ZnF ZnF YrdC Peptidase M22 Whole Protein = Sum of its Parts? PIRSF006256 On the basis of domain composition alone, biological function was predicted to be: ● RNA-binding translation factor ● maturation protease Actual function: ● [NiFe]-hydrogenase maturation factor, carbamoyltransferase Whole protein functional annotation is best done using annotated whole-protein families

BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity Practical classification of proteins:setting realistic goals We strive to reconstruct the natural classification of proteins to the fullest possible extent OR … Credit: Dr. Y. Wolf, NCBI

Domain Classification Allows ahierarchythat can trace evolution to thedeepest possible level, the last point of traceable homology and common origin Can usually annotate onlygeneral biochemical function Whole-protein Classification Cannot build a hierarchy deep along the evolutionary tree because ofdomain shuffling Can usually annotatespecific biological function(preferred to annotate individual proteins) Complementary Approaches • Can map domains onto proteins • Can classify proteins even when domains are not defined

Levels of Protein Classification

Whole protein classification PIRSF Domain classification Pfam SMART CDD Protein Classification Databases • Mixed • TIGRFAMS • COGs • Based on structural fold • SCOP InterPro: integrates various types of classification databases

CM ACT PDT InterPro Integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, ProDom, TIGRFAMs, PIRSF SF001500 Bifunctional chorismate mutase/ prephenate dehydratase

The Ideal System… • Comprehensive: each sequence is classified either as a member of a family or as an “orphan” sequence • Hierarchical: families are united into superfamilies on the basis of distant homology, and divided into subfamilies on the basis of close homology • Allows for simultaneous use of the whole protein and domain information (domains mapped onto proteins) • Allows for automatic classification/annotation of new sequences when these sequences are classifiable into the existing families • Expertly curated membership, family name, function, background, etc. • Evidence attribution (experimental vs predicted)

http://pir.georgetown.edu/ PIRSF Classification System • PIRSF: • Reflectsevolutionary relationshipsof full-lengthproteins • Anetworkstructure fromsuperfamiliestosubfamilies • Definitions: • Homeomorphic Family:Basic Unit • Homologous: Common ancestry, inferred by sequence similarity • Homeomorphic: Full-length similarity & common domain architecture • Hierarchy:Flexible number of levels with varying degrees of sequence conservation • Network Structure: allows multiple parents • Advantages: • Annotate both general biochemical and specific biological functions • Accurate propagation of annotation and development of standardized protein nomenclature and ontology

PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

Creation and Curation of PIRSFs UniProtKB proteins New proteins Unassigned proteins Automatic Procedure Automatic clustering • Computer-Generated (Uncurated) Clusters (35,000 PIRSFs) • Preliminary Curation (4,400 PIRSFs) • Membership • Signature Domains • Full Curation (3,200 PIRSFs) • Family Name, Description, Bibliography • PIRSF Name Rules Preliminary Homeomorphic Families Orphans Map domains on Families Automatic placement Merge/split clusters Add/remove members Computer-assisted Manual Curation Curated Homeomorphic Families Protein name rule/site rule Name, refs, description Final Homeomorphic Families Create hierarchies (superfamilies/subfamilies) Build and test HMMs

PIRSF Family Report:Curated Protein Family Information Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF Phylogenetic tree and alignment view allows further sequence analysis

PIRSF Hierarchy and Network: DAG Viewer

PIRSF Family Report (II) Integrated value added information from other databases Mapping to other protein classification databases

PIRSF Protein Classification: Platform for Protein Analysis and Annotation • Improves automatic annotationquality • Serves as a protein analysis platform for broad range of users • Matching a protein sequence to a curated protein family rather than searching against a protein database • Provides value-added information by expert curators, e.g., annotation of uncharacterized hypothetical proteins (functional predictions)

Family-Driven Protein Annotation Objective: Optimize for protein annotation • PIRSF Classification Name • Reflects the function when possible • Indicates the maximum specificity that still describes the entire group • Standardized format • Name tags: validated, tentative, predicted, functionally heterogeneous • PIRSF Classification Name • Hierarchy • Hierarchy • Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase) • Name Rules • Name Rules • Define conditions under which names propagate to individual proteins • Enable further specificity based on taxonomy or motifs • Names adhere to Swiss-Prot conventions (though we may make suggestions for improvement) • Site Rules • Site Rules • Define conditions under which features propagate to individual proteins

PIR Name Rules • Account for functional variations within one PIRSF, including: • Lack of active site residues necessary for enzymatic activity • Certain activities relevant only to one part of the taxonomic tree • Evolutionarily-related proteins whose biochemical activities are known to differ Monitor such variables to ensure accurate propagation • Propagate other properties that describe function: EC, GO terms, misnomer info, pathway • Name Rule types: • “Zero” Rule • Default rule (only condition is membership in the appropriate family) • Information is suitable for every member • “Higher-Order” Rule • Has requirements in addition to membership • Can have multiple rules that may or may not have mutually exclusive conditions

Example Name Rules Note the lack of a zero rule for PIRSF000881

Name Rule Propagation Pipeline Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node) Nothing to propagate No Name rule exists? Yes Protein fits criteria for any higher-order rule? Assign name from Name Rule 1 (or 2 etc) No Yes Assign name from Name Rule 0 No Yes PIRSF has zero rule? Nothing to propagate

Name Rule in Action at UniProt • Current: • Automatic annotations (AA) are in a separate field • AA only visible from www.ebi.uniprot.org • Future: • Automatic name annotations will become DE line if DE line • will improve as a result • AA will be visible from all consortium-hosted web sites

PIR Site Rules • Position-Specific Site Features: • active sites • binding sites • modified amino acids • Current requirements: • at least one PDB structure • experimental data on functional sites: CATRES database (Thornton) • Rule Definition: • Select template structure • Align PIRSF seed members with structural template • Edit alignment to retain conserved regions covering all site residues • Build Site HMM from concatenated conserved regions

Match Rule Conditions • Only propagate site annotation if all rule conditions are met: • Membership Check (PIRSF HMM threshold) • Ensures that the annotation is appropriate • Conserved Region Check (site HMM threshold) • Residue Check (all position-specific residues in HMMAlign)

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Presentation Transcript

Protein Sequence Analysis - Overview

Bioinformatics at NIAID-Biodefense Proteomics Administrative Resource Center

Protein Sequence Analysis - Overview

Generalized Protein Parsimony

PIR (Protein Information Resource)

Biological Data Integration

Protein Sequence Analysis - Overview -

False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony

Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Newborn Screening and Health Information Technology

Protein Identification by Sequence Database Search

Protein Information Resource

Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center

Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center

Review of Existing PRO Coverage of AD-related entities

Biomedical Ontologies

Lab

Sequence Based Analysis Tutorial

Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center

Sequence Based Analysis Tutorial

Protein Information Resource

PIR: Protein Information Resource