370 likes | 550 Views
The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO. Nicky Mulder mulder@ebi.ac.uk. Contents. Introduction to GOA Manual GOA annotation Electronic annotation: InterPro2GO GOA data flow Uses of GOA Future plans. What is GO annotation?. GO
E N D
The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder mulder@ebi.ac.uk
Contents • Introduction to GOA • Manual GOA annotation • Electronic annotation: • InterPro2GO • GOA data flow • Uses of GOA • Future plans
What is GO annotation? GO Term ID • An annotation is a statement that a gene product • has a particular molecular function • is involved in a particular biological process • is located within a certain cellular component • …as determined by a particular method • …as described in a particular reference. Evidence Code Reference
Gene Ontology Annotation (GOA) Database • GOA’s priority is to annotate the human, mouse and rat proteomes • Largest open-source contributor of annotations to GO • Provides 10 million annotations for more than 111,000 species • Share and integrate GO annotation
How do we annotate GO terms Manual Annotation Electronic Annotation • All annotations must: • be attributed to a source • indicate what evidence was found to support the GO term-gene/protein association
Manual annotation • High quality • Specific gene or gene product associations made using: • Peer reviewed papers • Evidence codes • BUT: • Time-consuming • Requires trained biologists
Pubmed ID, Evidence code Read papers Find GO term Annotate to protein GO and EBI ftp sites Oracle RDBMS GOA-association file Manual GO annotation
How successful is manual-GOA? 111740 taxa July 2006
UniProt Curated or electronic rule based mappings High quality electronic protein to GO associations InterPro Keyword HAMAP EC Curated mapping e.g. EC:1.1.1.1 > GO:alcohol dehydrogenase activity ; GO:0004022 GO Electronic Annotation • Large-scale assignment of GO terms to UniProtKB entries using existing information within database entries and manual mappings • Get IEA evidence code
Mappings of external concepts to GO http://www.geneontology.org/GO.indices.shtml
InterPro2GO mapping • InterPro is a resource that integrates protein signatures databases, e.g. Pfam, Prints, Prosite, ProDom, SMART, TIGRFAMs etc. • It provides a means of classifying proteins into families and identifying domains. • Each InterPro entry groups proteins belonging to the same family and potentially having the same function
InterPro2Go mapping • Done manually, but using tools • Look at InterPro and protein annotation • For all Swiss-Prot proteins matching entry truly: • Get stats on DE lines, keywords, comments • Check how conserved common annotation is • Find appropriate GO term at most specific level that applies to all proteins (not necessarily domains)
Tools used –”SQUID” Statistics options: keyword description Gene name Organism Comments, etc.
InterPro2GO sanity checks • Run weekly • Reports: • Obsolete GO terms • Obsolete (deleted) IPRs • Secondary IPRs
Exact term 151 24% Same lineage < granularity 273 43% Same lineage > granularity 24 4% New lineage 187 29% Minimal correct 424 67% Potentially incorrect 211 33% Precision 67-100% Quality of GO mapping • BioCreAtIvE test set -635 GO annotations through InterPro2GO Manually checked 44 proteins, 107 predictions: 97 correct (90%): -40 exact -57 same lineage 10 new lineage (unknown) 0 incorrect Camon et al., 2005, BMC Bioinformatics
How successful is IEA-GOA in general? • Provides large coverage • High Quality • However these annotations often use high-level GO terms and provide little detail. Manual ones: 336237 70728 Jun 2006
GOA data flow Gene association files
Gene Association file format http://www.geneontology.org/GO.annotation.shtml
Output from the GOA database New Non-Redundant: based on IPI GOA Cow Redundant GA slim for UniProt + GO slims Data also available in SRS, UniProt, QuickGO, MODs, Ensembl etc.
GA Files for Non-redundant species • Non-redundant complete protein set for each proteome is identified (>25% GO coverage) • Includes UniProt, IPI and MOD-specific IDs, e.g. mouse (MGI), rat (RGD), zebrafish (ZFIN) etc. • Xref files available with identifiers from: UniProt, IPI, RefSeq, Ensembl, UniGene etc. ftp://ftp.ebi.ac.uk/pub/databases/GO/goa ftp://ftp.ebi.ac.uk/pub/databases/integr8
Uses of GOA data • Access protein functional information • Look at relationships between proteins, e.g. IntAct • Connect biological information to gene expression data • Determine functional composition of a proteome –using GO slim
Uses of GOA Find functional information on proteins http://www.ebi.ac.uk/ego
Uses of GOA Find functional information on interaction proteins (IntAct) http:www.ebi.ac.uk/intact
Uses of GOA Overview proteome with GO Slim http://www.ebi.ac.uk/integr8
Uses of GOA Analysis of high-throughput data according to GO Microarray data analysis Proteomics data analysis GO classification GO classification Larkin JE et al, Physiol Genomics, 2004 Kislinger T et al, Mol Cell Proteomics, 2003 Cunliffe HE et al, Cancer Res, 2003
Future plans • Continue deep level annotation of human, mouse and rat • Manually annotate splice variants • Outreach and inclusion of new datasets e.g. grape • New electronic mappings, e.g. unipathway2go • Ortholog prediction for electronic GO annotation • Develop tools for annotation training
Acknowledgements Rolf Apweiler Head of sequence database group Evelyn Camon GOA Coordinator Daniel Barrell GOA Programmer Emily Dimmer GOA Curator Rachael Huntley GOA Curator David Binns & John Maslen QuickGO, GOA tools All EBI UniProtKB Curators, HAMAP(SIB), IntAct, GO Editorial Office @ EBI All GO Consortium & associate members