350 likes | 481 Views
Practical Ontologies. Lessons from the GO February 2011. The time was 1998-99. None of the model organism databases used standard terminology to describe biological function Drosophila sequence was imminent Largest genome sequenced at that time
E N D
Practical Ontologies Lessons from the GO February 2011
The time was 1998-99 • None of the model organism databases used standard terminology to describe biological function • Drosophila sequence was imminent • Largest genome sequenced at that time • Two weeks, 3 dozen scientists, all new software • How could we organize the annotation? • microArray technology was the latest research tool, and results needed to be described • AI folk and ontologists organized the first “bio-ontologies” workshop at ISMB
The Gene Ontology—the beginning • A handful of biologists (4) met in a bar in Montreal after the bio-ontologies workshop to share their frustrations and decided to just do it*… • Would demonstrate possibilities for data integration across the MODs (FlyBase, SGD, MGD) • Provided an organizing principle for the Drosophila genome annotation jamboree * i.e. Describe gene products in a biologically meaningful way.
Late summer 1999 AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
reads sequence assemble analysis Mountains of data Tentative function filtering Love-at-first-sight ‘GO’ directories Piles of data converging Functional knowns First-pass predictions
The Gene Ontology project • Annotated now • The importance of stress-testing • Don’t delay, use your ontology today • Do no harm (KISS) • i.e. Target the low hanging fruit, work on the obvious, high-confidence steps • Collaborate on concrete projects • Focusing the mind
Annotations • Have 3 primary components • The ontology term(s) • The entity instance (e.g. gene product) • The evidence for that assertion • An annotation is an evidence-based assertion which indicates that this entity is best classified/described by this term(s)
Identify genes Read paper(s) SPCC622.16c PMID:17449867 SPCC622.16c GO:0005720 IDA PMID:17449867 IDA Identify GO terms What type of evidence? Identify GO terms associated with each gene GO:0005720
Classification rule: Disambiguation = bud initiation = bud initiation = bud initiation The same name can be used to describe different things.
Classification rule: Disambiguation = toothbud initiation = cellularbud initiation = flowerbud initiation Include plain “bud initiation” as a synonym for each of these terms
Disambiguation • Glucose synthesis • Glucose biosynthesis • Glucose formation • Glucose anabolism • Gluconeogenesis Exactly the same thing can be described with different terms • Comparison is difficult, especially across species or across databases that each use one of these different variants • Use a single term, and plenty of synonyms
Annotation for a healthy ontology • Easier to find the most accurate term(s) to use • Avoids annotation errors • Easier for new curators to learn and understand • Develop annotation guidelines and training material • Enables automatic reasoning for searching & inference • Bottom line: • Following basic construction rules makes more useful ontologies
Improvement needed: Closing the loop Typical ontology developer Typical wet lab PI annotating data Doh! I get it now, says the computer.
The Gene Ontology project • Annotated now • The importance of stress-testing • Don’t delay, use your ontology today • Do no harm (KISS) • i.e. Target the low hanging fruit, work on the obvious, high-confidence steps • Collaborate on concrete projects • Focusing the mind
Filling in annotation gaps GO:0016301 kinase activity July 2008 GO:0016310 phosphorylation 2230 3823 1410 |P| = 3640 |F| = 6053 |F ∩ P| = 2230 |F ∩ not P| = 3823
part_of annotations propagate over part_of KIC1 IDA
part_of annotations propagate over part_of KIC1 IDA
part_of annotations propagate over part_of NDK1 IDA
part_of annotations propagate over part_of NDK1 IDA
Filling in annotation gaps GO:0016310 phosphorylation GO:0016301 kinase activity 2009
The H word—2011 time divergence • Characters in common are due to inheritance • Allows inferences about common ancestor
Evolution of MSH2 subfamilybiological process Somatic hypermutation of immunoglobulin genes Apoptosis Maintenance of DNA repeats Homologous recombination DNA repair
Ancestral inference E.c. Biochemistry: purification and assay A.t. MTHFR1 A.t. MTHFR2 D.d. S.p. S.c. MET13 S.p. S.c. MET12 C.e. D.m. A.g. D.r. G.g. H.s. MTHFR R.n. M.m. Genetics: mutant phenotypes divergence • Integration at points of common ancestry • Infer “hidden” character of living organisms • Explicitly leverage evolutionary relationships
Integrating different GO annotations PAINT Phylogenetic Annotation and Inference Tool
The Gene Ontology project • Annotated now • The importance of stress-testing • Don’t delay, use your ontology today • Do no harm (KISS) • i.e. Target the low hanging fruit, work on the obvious, high-confidence steps • Collaborate on concrete projects • Focusing the mind
Scoping 2009 • The ontology has a clearly specified and clearly delineated content. SGD MGD FlyBase GO
Decisions to make the work easier • Provide definitions for everything • Intelligible ontologies are more useful • To humans (for annotation) and • To machines (for searching, reasoning and error-checking) • Use content-free unique identifiers • Drive all semantics away from tracking • Don’t confuse the representational technology with the conceptual modeling
Implicit ontologies within the GO: • cysteine biosynthesis (ChEBI) • myoblast fusion (Cell Type Ontology) • hydrogen ion transporter activity (ChEBI) • snoRNA catabolism (Sequence Ontology) • wing disc pattern formation (Drosophila anatomy) • epidermal cell differentiation (Cell Type Ontology) • regulation of flower development (Plant anatomy) • B-cell differentiation (Cell Type Ontology)
Implicit anatomy ontology within the GO: GO brain development hindbrain development metencephalon development pons development trigeminal motor nucleus development
of is bearer of has part Alpha-Synuclein Mouse number Lewy body Substantia nigra Ischemic Mouse is bearer of number of Condensed Mitochondrion Condensed Mitochondrion Nucleus Golgi Apparatus Condensed Mitochondrion Lysosome Condensed Mitochondrion Dark Material Orthodox Mitochondrion
Common Interest • Sociology—to enlist the community, the ontology must meet each individual group’s immediate needs. • Too many people => Too many requirements • Outstanding problems • Closing the loop between ontology construction and ontology application • QC improvements • Prioritizing tasks • Visualization • …