220 likes | 230 Views
Learn about the Gene Ontology (GO), a set of three structured vocabularies that provide functional annotation of gene products. Discover how GO is dynamically cross-referenced to external databases. Find out how GO terms can be inserted into the Unified Medical Language System (UMLS) to expand its biomedical meaning and improve information retrieval.
E N D
The Gene Ontology • Set of three structured vocabularies • Provide functional annotation of gene products • Dynamic • Cross-references to external databases
The vocabularies • Molecular function — elemental activity or task • Biological process — broad objective or goal • Cellular component — location or complex
The vocabularies • Molecular function — elemental activity or task • nuclease, DNA binding, microtubule motor • Biological process — broad objective or goal • Cellular component — location or complex
The vocabularies • Molecular function — elemental activity or task • nuclease, DNA binding, microtubule motor • Biological process — broad objective or goal • mitosis, signal transduction, metabolism • Cellular component — location or complex
The vocabularies • Molecular function — elemental activity or task • nuclease, DNA binding, microtubule motor • Biological process — broad objective or goal • mitosis, signal transduction, metabolism • Cellular component — location or complex • nucleus, ribosome
GO structure • Directed acyclic graph (DAG) • Allows multiple parentage
True-path rule • Every path from a node back to the root must be biologically accurate
Relationship types • is_a • subclass: a is a type of b • part_of • physical part of (component) • sub-process of (process)
What makes up a GO term? • term name • go_id • definition and definition dbxref • GO synonym • general dbxref • comment
GO cross-links • Cross-references within GO • EC • RESID • MetaCyc • Mappings • SWISS-PROT keywords • Links in other databases • InterPro • UMLS/MeSH – in progress
Why insert GO into UMLS? • A rich, widely used source for expanding UMLS • Can be used to improve areas of MeSH • Potential for ‘non-fuzzy’ text mining using GO terms • MeSH terms manually assigned to papers
Unified Medical LanguageSystem (UMLS) • Research project maintained by the National Library of Medicine (NLM) • Aims to • allow computers to ‘understand’ biomedical meaning • improve retrieval and integration of computer readable info • Has three ‘Knowledge sources’: • UMLS Metathesaurus • SPECIALIST lexicon • semantic network
Knowledge sources • UMLS Metathesaurus • links multiple source vocabularies into unified concepts, includes MeSH (Medical Subject Headings) • GO to become source vocabulary • SPECIALIST lexicon • provides biomedical/English lexical info • semantic network • for categorizing concepts
Inserting GO into UMLS • inversion • converting GO to correct format for UMLS • insertion • inserting GO using matching algorithms • editing • all concepts containing GO term reviewed by hand
23.03% GO terms in concepts with other sources 76.97% GO terms in concepts where they are the only source Statistics • Approximately 23% of GO terms ‘match’ something in another source vocabulary
Statistics biological process cellular component molecular function 4.6% 27.8% 45.2% • % of GO in sources with other concepts, by GO vocabulary
19.74 % MSH2003_2002_08_14 (Medical Subject Headings) Statistics 7.34 % CSP2002 (Computer Retrieval of Information on Scientific Projects Thesaurus) 11.05 % • % of GO in sources with other concepts, by source SNMI98 (Systemized Nomenclature of Human and Veterinary Medicine) SNOMED CRISP GO MeSH
concept name concept id definition MeSH atoms GO atoms contexts EC number relationships to other concepts
Challenges with insertion • GO synonyms • As GO evolved - now not all synonymous • GO enzymes • GO separates enzyme function from enzyme ‘complexes’ - most vocabularies don’t • Semantic types • What semantic types now apply to concepts with GO atoms?
Future of insertion • Hoped that GO can be released with UMLS early next year • dependent on ironing out problems • Maintenance of insertion • GO changing continually - large differences between UMLS releases
www.geneontology.org • FlyBase & Berkeley Drosophila Genome Project • Saccharomyces Genome Database • PomBase (Sanger Institute) • Rat Genome Database • Genome Knowledge Base (CSHL) • The Institute for Genomic Research • Compugen, Inc • The Arabidopsis Information Resource • WormBase • DictyBase • Mouse Genome Informatics • Swiss-Prot/TrEMBL/InterPro • Pathogen Sequencing Unit • (Sanger Institute) • National Library of Medicine • Alexa McCray • Stuart Nelson • Bill Hole • Oak Ridge Institute for Science and Education • National Library of Medicine • U. S. Department of Energy The Gene Ontology Consortium is supported by an R01 grant from the National Human Genome Research Institute (NHGRI) [grant HG02273]. SGD is supported by a P41, National Resources, grant from the NHGRI [grant HG01315]; MGD by a P41 from the NHGRI [grant HG00330]; GXD by the National Institute of Child Health and Human Development [grant HD33745]; FlyBase by a P41 from the NHGRI [grant HG00739] and by the Medical Research Council, London. TAIR is supported by the National Science Foundation [grant DBI-9978564]. WormBase is supported by a P41, National Resources, grant from the NHGRI [grant HG02223]; RGD is supported by an R01 grant from the NHLBI [grant HL64541]; DictyBase is supported by an R01 grant from the NIGMS [grant GM064426].