560 likes | 758 Views
Functional Genomics – February 2012. Biomedical Ontologies and their role in functional genomics. Judith A. Blake, Ph.D. The Jackson Laboratory. Bioinformatics-What is that?. Bioinformatics is: the use of computers (and persistent data structures) in pursuit of biological research
E N D
Functional Genomics – February 2012 Biomedical Ontologies and their role in functional genomics Judith A. Blake, Ph.D. The Jackson Laboratory Func Genomics2012
Bioinformatics-What is that? Bioinformatics is: • the use of computers (and persistent data structures) in pursuit of biological research • an emerging new discipline, with its own goals, research program, and practitioners • the fundamental tool for 21st century biology • all of the above. Robert J. Robbins Func Genomics2012
Topics: • We need to coordinate the representation of information • from genetic and genomic studies, • as might be reported in the biomedical literature, and • from the output of high-throughput experiments • This is done by designing databases (e.g., MGI) and bio-ontologies (e.g., GO) to support comprehensive data integration • Such resources enable comparative analysis between different organisms and biological systems • With the objective of helping us gain new knowledge about biological systems and particularly about genetic components of human diseases Func Genomics2012
Bird Collections at the Smithsonian Natural History Museum Managing Biological Information is Nothing New Roxy Laybourne and others, photo by Chip Clark Func Genomics2012
The trouble with facts is that there are so many of them. Samuel McChord Crothers, The Gentle Reader (1903) Func Genomics2012
The data integration problem • Vast wealth of data residing in different databases • Meaning of those records must be reconciled for data to be automatically integrated Science database medical database Func Genomics2012
Accession File Func Genomics2012
TCTCTCCCCCGCCCCCCAGGCTCCCCCGGTCGCTCTCCTCCGGCGGTCGCCCGCGCTCGGTGGATGTGGCTCTCTCCCCCGCCCCCCAGGCTCCCCCGGTCGCTCTCCTCCGGCGGTCGCCCGCGCTCGGTGGATGTGGC TGGCAGCTGCCGCCCCCTCCCTCGCTCGCCGCCTGCTCTTCCTCGGCCCTCCGCCTCCTCCCCTCCTCCT TCTCGTCTTCAGCCGCTCCTCTCGCCGCCGCCTCCACAGCCTGGGCCTCGCCGCGATGCCGGAGAAGAGG CCCTTCGAGCGGCTGCCTGCCGATGTCTCCCCCATCAACTACAGCCTTTGCCTCAAGCCCGACTTGCTGG ACTTCACCTTCGAGGGCAAGCTGGAGGCCGCCGCCCAGGTGAGGCAGGCGACTAATCAGATTGTGATGAA TTGTGCTGATATTGATATTATTACAGCTTCATATGCACCAGAAGGAGATGAAGAAATACATGCTACAGGA TTTAACTATCAGAATGAAGATGAAAAAGTCACCTTGTCTTTCCCTAGTACTCTGCAAACAGGTACGGGAA CCTTAAAGATAGATTTTGTTGGAGAGCTGAATGACAAAATGAAAGGTTTCTATAGAAGTAAATATACTAC CCCTTCTGGAGAGGTGCGCTATGCTGCTGTAACACAGTTTGAGGCTACTGATGCCCGAAGGGCTTTTCCT TGCTGGGATGAGCCTGCTATCAAAGCAACTTTTGATATCTCATTGGTTGTTCCTAAAGACAGAGTAGCTT TATCAAACATGAATGTAATTGACCGGAAACCATACCCTGATGATGAAAATTTAGTGGAAGTGAAGTTTGC CCGCACACCTGTTATGTCTACATATCTGGTGGCATTTGTTGTGGGTGAATATGACTTTGTAGAAACAAGG TCAAAAGATGGTGTGTGTGTCCGTGTTTACACTCCTGTTGGCAAAGCAGAGCAAGGAAAATTTGCGTTAG AGGTTGCTGCTAAAACCTTGCCTTTTTATAAGGACTACTTCAATGTTCCTTATCCTCTACCTAAAATTGA TCTCATTGCTATTGCAGACTTTGCAGCTGGTGCCATGGAGAACTGGGGCCTTGTTACTTATAGGGAGACT GCATTGCTTATTGATCCAAAAAATTCCTGTTCTTCATCCCGCCAGTGGGTTGCTCTGGTTGTGGGACATG AACTCGCCCATCAATGGTTTGGAAATCTTGTTACTATGGAATGGTGGACTCATCTTTGGTTAAATGAAGG TTTTGCATCCTGGATTGAATATCTGTGTGTAGACCACTGCTTCCCAGAGTATGATATTTGGACTCAGTTT GTTTCTGCTGATTACACCCGTGCCCAGGAGCTTGACGCCTTAGATAACAGCCATCCTATTGAAGTCAGTG TGGGCCATCCATCTGAGGTTGATGAGATATTTGATGCTATATCATATAGCAAAGGTGCATCTGTCATCCG AATGCTGCATGACTACATTGGGGATAAGGACTTTAAGAAAGGAATGAACATGTATTTAACCAAGTTCCAA CAAAAGAATGCTGCCACAGAGGATCTCTGGGAAAGTTTAGAAAATGCTAGTGGTAAACCTATAGCAGCTG GTTTCTGCTGATTACACCCGTGCCCAGGAGCTTGACGCCTTAGATAACAGCCATCCTATTGAAGTCAGTG TGGGCCATCCATCTGAGGTTGATGAGATATTTGATGCTATATCATATAGCAAAGGTGCATCTGTCATCCG AATGCTGCATGACTACATTGGGGATAAGGACTTTAAGAAAGGAATGAACATGTATTTAACCAAGTTCCAA CAAAAGAATGCTGCCACAGAGGATCTCTGGGAAAGTTTAGAAAATGCTAGTGGTAAACCTATAGCAGCTG From the birth of the field of genetics until a decade ago, it was generally assumed that the parental origin of a gene could have no effect on its function. In the vast majority of studies carried out during the last 90 years, this paradigm has appeared to hold true. However, with increasingly sophisticated genetic and embryological investigations in the mouse, important exceptions to this rule have been uncovered over the last decade. First, the results of nuclear transplantation experiments carried out with single-cell fertilized embryos have demonstrated an absolute requirement for both a maternally-derived and a paternally-derived pronculeus to allow full-term development (McGrath and Solter, 1983). Second, in animals that receive both homologs of certain chromosomes or subchromosomal regions from one parent and not the other (through the mating of translocation heterozygotes as described in Section 5.2.3), dramatic effects on development can be observed including enhanced or retarded growth and outright lethality (Cattanach and Kirk, 1985). Third, either of two deletions that cover a small region of mouse chromosome 17 can be transmitted normally from a father to his offspring, but these same deletions cause prenatal lethality when they are maternally transmitted (Johnson, 1974; Winking and Silver, 1984). Fourth, similar parent-of-origin effects have been observed on the phenotypes expressed by animals that carry a targeted knock-out allele at the Igf2 locus (DeChiara et al., 1991). Finally, molecular techniques have been used to directly demonstrate the expression of transcripts from one parental allele and not the other at the Igf2r locus (Barlow et al., 1991) and the H19 locus (Bartolomei et al., 1991). The accumulated data indicate that a subset of mouse genes (on the order of 0.2%) will function differently in normal embryos depending on whether they have been inherited through the male or the female gamete, such that one allele will be expressed and the other will be silent. Genomic imprinting is the term that has been coined to describe this situation in which the phenotype expressed by a gene varies depending on its parental origin (Sapienza, 1989). Further experiments have demonstrated that, in general, the "imprint" is erased and regenerated during gametogenesis so that the function of an imprintable gene is fully determined by the sex of its progenitor alone, and not by earlier ancestors. Func Genomics2012
Crash Blossoms Crash Blossomsand other semantic ambiguities translating what we say into what we mean: data, words and knowledge “Violinist Linked to JAL Crash Blossoms” “Squad Helps Dog Bite Victim” “MacArthur Flies Back to Front” “Red Tape Holds Up New Bridge.” Func Genomics2012
The English Language is hard to learn, even for computers. Focus: creating the data structures and mining the biomedical literature to provide knowledge representations – with the objective of using logical reasoning applications and predictive approaches to ‘interrogate’ very large data sets, generating new hypothesis for further experimental investigation “Jessica Hahn Pooped After Long Day Testifying” Func Genomics2012
What is an ontology? Func Genomics2012
A biological ontology is: • A formal representation of some portion of biological reality sense organ • what kinds of things exist? eye disc is_a eye develops from • what are the relationships between these things? part_of ommatidium Func Genomics2012
Why do we need ontologies? Func Genomics2012
Connections are not made explicit by default • Computers are not intelligent • We need to spell out interconnectedness of entities • Specificity Bone mineralization vs ossification • Granularity Osteocyte vs bone • Spatial Gill membrane and branchiostegal ray • Perspective Anatomy vs physiology • Causally related entities • pathways • development • Evolutionary Homology and descent Func Genomics2012
Ontologies : the key to data integration • Ontologies provide: • rigorous, shared computable definitions for terms • classifications and connections that can be used for database search and inference Func Genomics2012
Biomedical Ontologies • Ontologies are human and machine readable classification of biological knowledge. • Ontologies have: • Terms • Term definitions • Relationships among terms Annotation of genes and proteins using ontologies are key to data integration Func Genomics2012
Good ontology design is required for data integration • Not any old ontology will do • Data integration served poorly by poor ontologies • How do we know good ontologies? • Types and classifications should be constructed according to science and should reflect nature • Ontology constructed along lines of ontology best practices • http://www.obofoundry.org • Formal definitions and relations • Based on distinction between types and instances • Distinction between types and their labels Func Genomics2012
The Gene Ontology • Mid-size • ~33,700 terms in all 3 ontologies • ~2n,nnn links (is_a, part_of, regulates) • Each term represents a type • Terms also have alternate labels (synonyms) • These do not represent distinct types • Humans use different labels to refer to the same biological pattern • E.g: endoplasmic reticulum vs ER Func Genomics2012
Ontology is not nomenclature • A type can have many labels • Preferred label (term) • Synonyms, aliases • Types are not labels • Types are the underlying pattern • Identified by a formal definition • Labels are important for doing science • But life existed for billions of years quite happily prior to the invention of names and labels • Good ontology separates the underlying patterns in nature from the labels used to describe them Func Genomics2012
Ontologies and annotation • Ontologies are of little practical use without annotation • GO has ~6 million annotations linking genes and gene products to GO terms • Mostly (but not all) MOD & Human • Same terms are shared across species • All annotation statements have provenance • Source/publication • Evidence & evidence codes Func Genomics2012
Use of GO annotations • Database search • Database integration • Automating further annotation • Data mining and data analysis • Microarray analysis: • 1. Extract cluster of co-exressed genes • 2. Analyses annotations for enrichment of certain terms Func Genomics2012
What is a Database? • an organized body of related information • In computing, a database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. The records retrieved in answer to queries become information that can be used to make decisions. Func Genomics2012
Mouse Genome Informatics (MGI) Database • Comprehensive information resource about the laboratory mouse • Provides consensus representation of the mouse genome • International scientific community resource • Integrated data acquisition and query capabilites MGI Database is a Relational Database: Information is stored in tables that have relationships to each other. This facilitates query and retrieval of subsets of data. Func Genomics2012
MGI’s primary mission is to facilitate the use of mouse as a model for human biology by providing integrated access to data on the genetics, genomics, and biology of the laboratory mouse. Database Resource:MouseGenome Informatics (MGI) variants & polymorphisms expression strain geneaology Hermansky-Pudlak syndrome Mouse model & human phenotype sequence genome location tumors mouse/human orthologs & maps gene function Information content spans from sequence to phenotype/disease Func Genomics2012
Gather data from multiple sources Factor out common objects Assemble integrated objects MGI integrates genetic, genomic and phenotypic data • Within MGI • Genes • Sequence • Expression • Literature • Alleles • Phenotypes • Between MGI and others • Via shared sequence annotations……UniProt, EntrezGene, Ensembl • Via shared semantic representations ……Drosophila, Arabidopsis, etc. Integrate Func Genomics2012
Annotation Pipeline Literature & Loads New Gene, Strain or Sequence? • Data Acquisition • Object Identity • Standardizations • Data Associations • Integration with other bioinformatics resources Controlled Vocabularies Evidence & Citation Co-curation of shared objects and concepts Func Genomics2012
Automated (mostly) Data Integration (Loads) Clones EG mouse GO RPCI UniProt MGC MP Associations Vocabularies DFCI Anatomy DoTS Interpro NIA OMIM Unigene PIRSF TreeFam Annotation Gene traps MGI db GenBank EG chimp RefSeq EG dog Sequences UniProt EG rat EG human DFCIseq HCOP DoTSseq Homologene NIAseq Non-mouse dbSNP NCBI VEGA SNP db UniSTS Gene models and coordinates Ensembl Func Genomics2012 microRNAs
Manual (mostly) annotation of the biomedical literature > 12,000 / year Func Genomics2012
Load Program Summary of Data Loaded Mouse EntrezGene EntrezGene IDs for mouse markers. Plus marker-to-sequence associations from EntrezGene not already in MGD Human/Rat EntrezGene Nomenclature, map position and other data regarding human and rat genes. OMIM associations for human. GenBank Seq Mouse sequence records from GenBank RefSeq Seq Mouse sequence records from RefSeq UniProt/TrEMBL Seq Mouse sequence records from UniProt and TrEMBL TIGR/DoTS/NIA Seq Mouse consensus sequence records from TIGR/DoTS/NIA clusters TIGR/DoTS/NIA Association Associations between TIGR/DoTS/NIA cluster sequences and markers. Ensembl Gene Model Ensembl gene model sequences, coordinates, & associations between these & markers NCBI Gene Model NCBI gene model sequences, coordinates, & associations between these & markers UniProt Association UniProt/TrEMBL IDs and additional GenBank IDs for mouse markers. Plus GO and InterPro annotations UniGene Association UniGene cluster IDs for mouse markers. EST cDNA Clone Mouse IMAGE, NIA, MGC, Riken, cDNAs and EST sequence associations MGC Association MGC IDs and associations between MGC full length sequences and MGC cDNAs RPCI Clone RPCI 23/24 BAC clones and sequence associations GO Vocabulary Updated Gene Ontology (GO) vocabularies from the central GO site. OMIM Vocabulary Updated OMIM disease terms MP Vocabulary Updated MP vocabulary (from OBO-Edit) Anatomy Updated adult mouse anatomy ontology (from OBO-Edit) Mapping panel JAX, EUCIB, Copeland-Jenkins and many others PIRSF Mouse PIR superfamily terms and associations to markers SNPs Mouse SNPs from dbSNP and associations between SNPs & markers. Data acquisition is constant Func Genomics2012
Data type Working relationship Gene Symbol/Name MGD makes primary assignment; coordination with HGNC, RGNC Allele Symbol/Name MGD makes primary assignment Strain Designations MGD makes primary assignment Gene -to- nucleotide sequence association Co-curation with NCBI Gene -to- protein sequence association Co-curation with UniProt Gene Ontology (GO) annotations MGD provides primary data set Mammalian Phenotype Ontology MGD develops and applies vocabulary Gene homology data between mouse & other species MGD curated orthology relationships Genotype -to- phenotype data MGD provides primary curation Mouse model -to- human disease (OMIM) MGD provides primary curation Who is the authority? Mouse data for which MGI serves as the authoritative source. Func Genomics2012
Snapshot of MGI data content Func Genomics2012
Having the data, we want to ask complex questions Func Genomics2012
The knowledge is in the details Curators use controlled terms from structured vocabularies (ontologies) to annotate complex biological systems described in the literature Func Genomics2012
Gene Nomenclature Gene/Marker Type Allele Type Assay Type Expression Mapping Molecular Mutation Inheritance Mode Tissue Types Cell Types Cell Lines Units Cytogenetic Molecular ES Cell Line Strain Nomenclature Keyword lists support data integration Keyword lists standardize descriptions and enable comprehensive data retrieval Func Genomics2012
But, keyword lists are not enough Process terms Organogenesis Blood vessel development Angiogenesis Vasculogenesis • Sheer number of terms too much to remember and sort • Need standardized, stable, carefully defined terms • Need to describe different levels of detail • So…defined terms need to be related in a hierarchy • With structured vocabularies/hierarchies • Parent/child relationships exist between terms • Increased depth -> Increased resolution • Can annotate data at appropriate level • May query at appropriate level • All model organisms database and genome annotation systems have same issues Func Genomics2012
And so, we started theGene Ontology (GO) • Formed to develop a shared language adequate for the annotation of molecular characteristics across organisms; a common language to share knowledge. • Seeks to achieve a mutual understanding of the definition and meaning of any word used; thus we are able to support cross-database queries. • Members agree to contribute gene product annotations and associated sequences to GO database; thus facilitating data analysis and semantic interoperability. Func Genomics2012
What is Ontology? • Dictionary:A branch of metaphysics concerned with the nature and relations of being. • Barry Smith: The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality. 1700s 1606 Func Genomics2012
A (machine and human) interpretable representation of some aspect of biological reality Optic placode sense organ eye develops from is_a part_of sclera A biological ontology is: • what kinds of things exist? • what are the relationships between these things? Func Genomics2012 http://www.macula.org/anatomy/eyeframe.html
Gene Ontology: widely adopted AgBase Func Genomics2012
GO represents selected molecular domains • Molecular Function = elemental activity/task - the tasks performed by individual gene products; examples are carbohydrate bindingand ATPase activity • Biological Process = biological goal or objective • broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Cellular Component= location or complex • subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme • Sequence Ontology = genome features • regions, attributes, variants; examples includeexon, CpG island, and transgenic insertion • Cell Ontology = cell types • Examples include photoreceptor cell and pillar cell Func Genomics2012
GO reflects biological knowledge for computers Cellular Component GO term: mitochondrion GO id: GO:0005739 Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration. Biological Process GO term: tricarboxylic acid cycle Synonym: Krebs cycle Synonym: citric acid cycle GO id: GO:0006099 Molecular Function GO term: Malate dehydrogenase. GO id: GO:0030060 (S)-malate + NAD(+) = oxaloacetate + NADH. Func Genomics2012
Terms are defined graphically relative to other terms Func Genomics2012
Ontology Structure node edge node node • Ontologies can be represented as graphs, where the nodes are connected by edges • Nodes = terms in the ontology • Edges = relationships between the concepts Func Genomics2012
Ontological relations • Types are related • Network of terms forms a graph • Terms (nodes) • The edge type (relation) is important • Two common relations: • Is_a • Part_of Func Genomics2012
organ is_a cavitated organ is_a Types (represented in the ontology) eyeball instance_of Instances (NOT represented in the ontology) Func Genomics2012
Formal definition of is_a • is_a holds between types • X is_a Y holds if and only if: • Given any thing that instantiates X at some time, that thing also instantiates Y at the same time Func Genomics2012
P I Brain development [GO:0007420] (141 genes, 207 annotations) I GO terms are used for functional annotations Denotes an ‘is-a’ relationship Denotes a ‘part-of’ relationship I Func Genomics2012
Annotations are assertions • There is evidence that this gene product can be best classified using this term • The sourceof the evidence and other information is included • There is agreement on the meaning of the term Func Genomics2012
Annotating Gene Products using GO GO:0047519 P05147 GO:0047519 IDA PMID:2976880 PMID: 2976880 IDA P05147 Gene Product Reference Evidence GO Term Func Genomics2012
Evidence codes describe the basis of the annotation • IDA: Inferred from direct assay • IPI: Inferred from physical interaction • IMP: Inferred from mutant phenotype • IGI: Inferred from genetic interaction • IEP: Inferred from expression pattern • IEA: Inferred from electronic annotation • ISS: Inferred from sequence or structural similarity • TAS: Traceable author statement • NAS: Non-traceable author statement • IC: Inferred by curator • RCA: Reviewed Computational Analysis • ND: no data available Direct Experiment in organism NO Direct Experiment Inferred from evidence Func Genomics2012