Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl

Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl.gov

Let’s get started… Information from databases is used to predict the function of a protein (functional annotation). • Product name • Enzyme catalog number • Domain architecture • …

But what is function? cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF) • molecular/enzymatic (methyltransferase) • Reaction (methylation) • Substrate (cobalt-precorrin-4) • Ligand (S-adenosyl-L-methionine) • metabolic (cobalamin biosynthesis) • physiological (maintenance of healthy nerve and red blood cells, through B12).

Functional annotation Predict the biochemistry and physiology of an organism based on its genome sequence Explain known biochemical and physiological properties

Homologs/Orthologs/Paralogs

Function prediction • Function transfer by homology • Homology • implies a common evolutionary origin. • not retention of similarity in any of their properties. • Homology ≠ similarity of function. Punta & Ofran. PLOS Comp Biol. 2008

Trust transfer of annotation ? Punta & Ofran. PLOS Comp Biol. 2008

Dos and Don’ts

What if nothing is similar ? • Subcellular localization • Gene context • Special features • Prediction of binding residues (DISIS, bindN) S ~ S S ~ S Periplasm Cytoplasm

Annotation should make sense Model pathway Substrate A Substrate B Substrate C Substrate D Enzyme 1 Enzyme 1 Enzyme 3 Enzyme 3 Enzyme 2 ? ? ✓ Genome annotation Enzyme 2

Annotation should make sense

Databases • Databases used for the analysis of biological molecules. • Databases contain information organized in a way that allows users/researchers to retrieve and exploit it. • Why bother? • Store information. • Organize data. • Predict features (genes, functions ...). • Understand relationships (metabolic reconstruction).

Primary nucleotide databases EMBL/GenBank/DDBJ (http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl) • Archive containing all sequences from: • genome projects • sequencing centers • individual scientists • patent offices • The sequences are exchanged between the three centers on a daily basis. • Database is doubling every 10 months. • Sequences from >140,000 different species. • 1400 new species added every month. Year Base pairs Sequences 2004 44,575,745,176 40,604,319 2005 56,037,734,462 52,016,762 2006 69,019,290,705 64,893,747 2007 83,874,179,730 80,388,382 2008 99,116,431,942 98,868,465

Primary protein sequence databases • Contain coding sequences derived from the translation of nucleotide sequences • GenBank • Valid translations (CDS) from nt GenBank entries. • UniProtKB/TrEMBL (1996) • Automatic CDS translations from EMBL. • TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries.

Errors in databases There are many errors in the primary sequence databases: • In the sequences themselves: • Sequencing errors. • Cloning vectors sequences. • In the annotations: • Inaccuracies, omissions, and even mistakes. • Inconsistencies between some fields.

{ { { Partial and complete sequence duplications Redundancy Redundancy is a major problem. Entries are partially or entirely duplicated: • e.g. 20% of vertebrate sequences in GenBank.

C C Curators GA GA ATT GA GA C ATT C RefSeq TATAGCCG ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA CGTGA ATTGACTA TTGACA Labs TTGACA TTGACA Genome Assembly ACGTGC TATAGCCG ACGTGC CGTGA TATAGCCG CGTGA ATTGACTA TATAGCCG ATTGACTA CGTGA ATTGACTA ATTGACTA TTGACA TATAGCCG ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank GA UniGene AT C C Algorithms C C ATT GA GA GA GA ATT ATT GA GA GA GA C C ATT ATT NCBI Derivative Sequence Data

RefSeq • Curated transcripts and proteins. • reviewed by NCBI staff. • Model transcripts and proteins. • generated by computer algorithms. • Assembled Genomic Regions (contigs). • Chromosome records.

Secondary protein databases • Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro) • a curated protein sequence database • high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.) • a minimal level of redundancy • high level of integration with other databases

Classification databases • Groups (families/clusters) of proteins based on… • Overall sequence similarity. • Local sequence similarity. • Presence / absence of specific features (active site, signal peptides… ). • Structural similarity. • ... • These groups contain proteins with similar properties. • Specific function, enzymatic activity. • General function. • Evolutionary relationship. • …

Overall sequence similarity

COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages A function (specific or broad) has been assigned to each COG. Clusters of orthologous groups (COGs) http://www.ncbi.nlm.nih.gov/COG/

How it works Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2

Profiles & Pfam • A method for classifying proteins into groups exploitsregion similarities, which contain valuable information (domains/profiles). • These domains/profiles can be used to detect distant relationships, where only few residues are conserved.

Regions similarity

Pfam http://pfam.sanger.ac.uk HMMs of protein alignments (local) for domains, or global (cover whole protein)

TIGRfam • Full length alignments. • Domain alignments. • Equivalogs: families of proteins with specific function. • Superfamilies: families of homologous genes. • HMMs http://www.tigr.org/TIGRFAMs/

KEGG orthology

Composite pattern databases • To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource– InterPro • Release 30.0 (Dec10) contains21178 entries • Central annotation resource, with pointers to its satellite dbs http://www.ebi.ac.uk/interpro/

* It is up to the user to decide if the annotation is correct *

ENZYME

ENZYME http://ca.expasy.org/enzyme/

Contains information about biochemical pathways, and protein interactions. KEGG http://www.kegg.com

Functional annotation IMG term YES COGs PSI BLAST 1e-2 NO TIGRfam YES BLASTp <1e-10, >45% id, >70% length KO terms NO gene COG YES IMG BLASTp evalue<10, 20 best hits NO COG + pfam YES Pfam TIGRfam NO Hmmsearch (BLAST preprocessing) YES Pfam NO BLAST YES Product name (based on translation tables) NO http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf hypothetical

Sequencing projects & Metadata http://www.genomesonline.org

Literature search • PubMed http://www.ncbi.nlm.nih.gov/Pubmed

Specialized databases • There is a large number of databases devoted to specific organisms. • For some model organisms there are often concurrent systems. • These databases are typically associated to sequencing or mapping projects.

Signal transduction, regulation, protein-protein interactions TRANSFAC (Transcription Factor database) BRITE (Biomolecular Relations in Information Transmission and Expression database) DIP (Database of Interacting Proteins) BIND (Biomolecular Interaction Network database) BioCarta Biochemical pathways KLOTHO (Biochemical Compounds Declarative database) BRENDA (enzyme information system) LIGAND (similar to Enzyme but with more information for substrates) Gene order and co-occurrence STRING Other specialized databases • Gene expression • GXD (Mouse Gene Expression Database) • The Stanford Microarray Database • Mapping • GDB (Genome Data Base) • EMG (Encyclopedia of Mouse Genome) • MGD (Mouse Genome Database) • INE (Integrated Rice Genome Explorer) • Protein quantification • SWISS-2DPAGE • PDD (Protein Disease Database) • Sub2D (B. subtilis 2D Protein Index) • 3D structures • PDB (Protein Data Bank) • MMDB (Molecular Modelling Data Base) • NRL_3D (Non-Redundant Library of 3D Structures) • SCOP (Structural Classification of Proteins) • Polymorphism • ALFRED (Allele Frequency Database) • Molecular interactions • DIP (Database of Interacting proteins) • BIND (Biomolecular Interaction Network Database)

List of databases http://www.oxfordjournals.org/nar/database/c

Blocks MIMMAP PROSITEDOC REBASE PDBFINDER ALI OMIM ProDom PROSITE SWISSNEW ENZYME DSSP SWISSDOM HSSP FSSP GenBank MOLPROBE PDB SWISS-PROT NRL_3D EPD ECDC YPDREF PMD EMBL EMNEW YPD TFSITE TrEMBLNEW ProtFam FlyGene TrEMBL PIR TFACTOR Databanks interconnection Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.

Summary • Gene annotation should make sense in the context of the organism • We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more… • They help predict the function, or the network of functions. • Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required QUESTIONS?

Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl