1 / 36

Ontologies, data standards and controlled vocabularies

Ontologies, data standards and controlled vocabularies. Why use standards and CVs?. Very important in High-throughput biology to sort through the vast amounts of data To use the same data labels universally To enable quick retrieval of data To enable easy comparison of data

fayre
Download Presentation

Ontologies, data standards and controlled vocabularies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontologies, data standards and controlled vocabularies

  2. Why use standards and CVs? • Very important in High-throughput biology to sort through the vast amounts of data • To use the same data labels universally • To enable quick retrieval of data • To enable easy comparison of data • To remove ambiguities

  3. What’s in a name? • What is a cell?

  4. What’s in a name? • What is a cell? OR

  5. What’s in a name? • What is a cell? OR

  6. What’s in a name? • What is a cell?

  7. Ambiguities in naming • The same name can be used to describe different concepts, e.g: • Glucose synthesis • Glucose biosynthesis • Glucose formation • Glucose anabolism • Gluconeogenesis • All refer to the process of making glucose • Makes it difficult to compare the information • Solution: use Ontologies and Data Standards

  8. Ontologies • An ontology is a formal specification of terms and relationships between them –widely used in biology and boinformatics (e.g. taxonomy) • The relationships are important and represented as graphs • Ontology terms should have definitions • Ontologies are machine-readable • They are needed for ordering and comparing large data sets

  9. Gene Ontology (GO) • http://www.geneontology.org • Many annotation systems are organism-specific or different levels of granularity • GO introduced standard vocabulary first used for mouse, fly and yeast, but now generic • Three ontologies: molecular function, biological process and cellular component

  10. GO Ontologies • Molecular function: tasks performed by gene product –e.g. G-protein coupled receptor • Biological process: broad biological goals accomplished by one or more gene products –e.g. G-protein signaling pathway • Cellular component: part(s) of a cell of which a gene product is a component; includes extracellular environment of cells –e.g nucleus, membrane etc.

  11. GO hierarchy Relationships: “is-a” “part of”

  12. How do gene products get GO terms? • Electronic annotation: • Through mappings to other biological entities and then automatic inference to proteins • Manual annotation: • Model organism databases • Gene Ontology Annotation (GOA) project • Evidence codes –attached to all GO annotations to show the source

  13. Evidence Codes

  14. Electronic annotation: GO mappings

  15. GO:fatty acid biosynthesis (GO:0006633) GO:acetyl-CoA carboxylaseactivity (GO:0003989) GO:acetyl-CoA carboxylase activity (GO:0003989) GO:DNA repair (GO:0006281) Electronic annotation: GO mappings Fatty acid biosynthesis (SwissProt keyword) EC:6.4.1.2 (EC number) IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry) MF_00527: Putative 3-methyladenine DNA glycosylase (HAMAP) Camon et al. BMC Bioinformatics. 2005; 6 Suppl 1:S17

  16. UniProt entry

  17. Chicken Rat Rat Cow Dog Dog Cow Mouse Mouse Automatic transfer of annotations to orthologs Ensembl GO term projection via gene homology COMPARA Homologies between different species calculated GO terms projected from MANUAL annotation only (IDA, IEP, IGI, IMP, IPI) One-to-one and apparent one-to-one orthologies only used. Anopheles Drosophila http://www.ensembl.org/info/data/compara

  18. Manual annotation: GOA Project • Largest open-source contributor of annotations to GO • Member of the GO Consortium since 2001 • Provides annotation for more than 130,000 species • GOA’s priority is to annotate the human proteome • GOA is responsible for human, chicken, bovine and many other annotations for the GO Consortium • Annotation is done through reading of the literature

  19. Reference Genomes • Comprehensive annotation of a set of disease-related proteins in human • Generate a reliable set of GO annotations for the 12 selected genomes • Empowers comparative methods used in first pass annotation of other proteomes. Arabidopsis thaliana Caenorhabditis elegans Danio rerio (zebrafish) Dictyostelium discoideum Drosophila melanogaster Escherichia coli Homo sapiens Saccharomyces cerevisiae Mus musculus Schizosaccharomyces pombe Gallus gallus Rattus norvegicus

  20. Accessing GO data (1) http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

  21. Accessing GO data (2) Human Insulin Receptor (P06213) QuickGO browser http://www.ebi.ac.uk/quickgo

  22. Accessing GO data (3) Gene Association Files http://www.geneontology.org/GO.current.annotations.shtm

  23. Accessing GO data (3) Gene Association File example

  24. Downloading GOA data ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ http://www.ebi.ac.uk/GOA/downloads.html

  25. Uses of GO 1 Functional annotation of proteins

  26. Uses of GO 2 Find functional information on interaction proteins (IntAct)

  27. Uses of GOA Uses of GO 3 Microarray data analysis Analysis of high-throughput data Proteomics data analysis GO classification GO classification Larkin JE et al, Physiol Genomics, 2004 Cunliffe HE et al, Cancer Res, 2003

  28. Other Ontologies:Open Biomedical Ontologies http://obo.sourceforge.net • Central location for accessing well-structured controlled vocabularies and ontologies for use in the biological and medical sciences. • Provides simple format for ontologies that can encode terms, relationships between terms and definitions of terms including those taken from external ontologies.

  29. Scope of Open Biomedical Ontologies • Anatomy • Animal natural history and life history • Chemical • Development • Ethology • Evidence codes • Experimental conditions • Genomic and proteomic • Metabolomics • OBO relationship types • Phenotype • Taxonomic classification

  30. Ontology Lookup Service (OLS) • Single point of query for currently 47 ontologies. • Ontologies are updated daily from CVS repositories, including the OBO CVS repository and the PRIDE CVS repository. • A tool that offers interactive and programmatic interfaces for queries on term names, synonyms, relationships, annotations and database cross-references. • Originally developed for using ontologies in PRIDE.

  31. The issue faced • These relationships have consequences when querying a database annotated using the ontology. • What happens when I ask for PRIDE experiments describing the proteome of brain tissue?

  32. Using Ontologies in PRIDE For an experiment you want to define: • Species: Newt / NCBI Taxonomy ID • Tissue / organ / cell type: BRENDA Tissue ontology, Cell Type ontology; • Sub-cellular component: Gene Ontology: GO; • Disease: Human Disease: DOID; • Genotype: GO; • Sample Processing: PSI Ontology; • Mass Spectrometry: PSI-MS Ontology; • Protein Modifications: PSI-MOD Ontology

  33. OLS usage examples • http://www.ebi.ac.uk/ontology-lookup/ • What is the accession for “mitochondrion” in GO? In MeSH? • search by term name in a specific ontology or across all • I’m looking for a term to annotate my protocol step but I’m not sure what term to use. • browse an ontology • I’m looking for all the experiments done on liver tissue? • get all children term of liver and query on those as well • My data set was annotated with GO version 123 but that was a long time ago? • get updated term names for the identifiers you have and see if any have been made obsolete

  34. Standards for data exchange • Systems Biology Markup Language (SBML) –computer-readable format for representing models of networks • Biological Pathways Exchange (BioPAX) –format for representing pathways • Proteomics Standards Initiative (PSI, MIAPE) • Microarray standards –MIAME and MAGE

  35. MIAPE/MIAME principles • Enough information to: • Remove ambiguity in experiment • Allow easy interpretation of results • Allow experiment to be repeated • Enable comparison across similar experiments • Use controlled vocabularies

  36. Using ontologies and standards • So much data in different places –need to organize and share it • Used for data retrieval and comparison –easier to query • Used for data integration and exchange –standard representation • Used for evaluation –need “gold standard”

More Related