420 likes | 528 Views
Week 9 – Using Ontologies in Biomedical Research. MED267 Modeling Clinical Data and Knowledge for Computation. Amarnath Gupta. SOME PRELIMINARIES AND SOME NOT-SO-PRELIMINARIES. A brief Recap. Ontology.
E N D
Week 9 – Using Ontologies in Biomedical Research MED267Modeling Clinical Data and Knowledge for Computation Amarnath Gupta
SOME PRELIMINARIES AND SOME NOT-SO-PRELIMINARIES A brief Recap
Ontology • A formal representation of knowledge as a set of concepts within a domain, and the relationships between those concepts. • Classes: sets, collections, concepts, classes in programming, types of objects, or kinds of things • Attributes: aspects, properties, features, characteristics, or parameters that objects (and classes) can have • Relations: ways in which classes and individuals can be related to one another • Individuals: instances or objects (the basic or "ground level" objects) • Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input • Rules: statements in the form of an antecedent-consequent sentence that describe the logical inferences that can be drawn from an assertion in a particular form • Axioms: assertions (including rules) in a logical form that together comprise the overall theory in its domain of application. Often Expressed in a language called OWL-DL
An ontology can be viewed as a graph with an acyclic backbone and a logical interpretation
Querying ontologies • An ontology is a 2-graph system • Class graph • Instance graph • Data Queries • Binding retrieval • Subgraph retrieval • Reasoner Queries • Inferencing • Classification • Consistency • SPARQL 1.0 • An edge query language • SPARQL 1.1 • An edge query language with regular expressions on edges • OWL-QL • DL query language • Rule Language • SWRL • Emerging trends • Keyword query languages • Subgraph query languages We will revisit the query language issue as we go forward
Upper Ontologies • An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies. It employs a core glossary that contains the terms, associated object properties and relationships as they are used in various relevant domain sets. • We have used the Basic Formal Ontology (BFO http://www.ifomis.org/bfo/publications ) and Relation Ontology (RO) for our work • Continuants and Occurrents • Classification and Differentiation • Standardizing Relationships • Temporal parameter plasma membrane is acell component that has as its parts a maximal phospholipids bilayer in which instances of two or more types of protein are embedded. Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall CJ, Neuhaus F, Rector A, Rosse C Relations in Biomedical Ontologies. Genome Biology, 2005.
BFO Continuant Occurrent Process, event Independent Continuant thing Dependent Continuant quality temperature depends on bearer
Merging Ontologies • Any real application needs to make use of multiple ontologies • Often the strategy is to construct a specific ontology by assembling elements of multiple ontologies • What happens if • One ontology uses an upper ontology (say BFO) and another doesn’t • One ontology uses a fixed set of relationships and another doesn’t
OBI – An Ontology for Biomedical Investigations • OBI models experiments
processed material device organization cell culture chemical entities in solution PCR product Some material entities in OBI molecular entity (ChEBI) material entity protein complex(Gene Ontology) cell(Cell Ontology) anatomical entity (FMA, CARO) organism (NCBI taxonomy)
MIREOTMinimal Information to Reference External Ontology Terms • The idea • the minimal set thatallows to unambiguously identify a term • URI of the class • URI of the source ontology • Superclass of the term in the source ontology • Position in the target ontology • Additional useful information • Label, • Definition, • Other annotations: adding “human-readable” information • Superclasses: for example, NCBI taxonomy • Problem • Lose complete inference • But because the imported ontology might not be commensurate with the base ontology, the inferences are questionable
Modularization of Ontologies • A set of principles for • Decomposing a larger ontologies into smaller meaningful components • Assimilating a set of component ontologies into a larger ontology • Modules must • Have semantic locality • Preserve loose coupling and autonomy • Enable partial reuse of knowledge • Preserve directionality of knowledge import • Ensure scalability
Uberon – an integrated multi-species anatomy ontology • Over 6,500 classes representing anatomical entities • Represents structures in a species-neutral way and includes extensive associations to existing species-centric anatomical ontologies, • Allows integration of model organism and human data • Uses novel methods for representing taxonomic variation • Used for translational phenotype analyses. Using taxonomic constraints
Finding Drugs for Rare/Orphan Diseases A biomedical research problem
Orphan Diseases Child with Tay-Sachs disease Image Source: http://www.ntsad.org/index.php/the-diseases Source (1): Rados FDA Consumer 2003 • Diseases affecting less than 200,000 people in the U.S • Approx. 7000 rare diseases affecting 25 million globally1. • Orphan Drug Act – 1983 • Incentives for orphan drug development. • Around 355 with approved orphan drug therapies. • Recent interest of pharma giants in Orphan drug R&D.
Orphan disease information space- a need for systematic analysis
Approaches to drug repositioning Drug-centric Disease-centric Repositioned Drugs Source (5): Adapted from Liu et al. Sept 2012 • Drug-centric approach • Hypothesis: ‘similar drugs’ have same therapeutic effects and are equally effective for a disease. • Disease-centric approach • Hypothesis: ‘similar diseases’ need the same therapies and can be treated with the same drugs.
Finding Drugs as an Exploratory Problem Genetic Variant/ Disease ↑ or ↓ production Gene product Genetic Variant (SNP) ↑ or ↓ Expression ↑ or ↓ production ↑ or ↓ progression Drug Disease Drug Receptor/ Enzyme ↑ or ↓ progression Biomarker discovery • If a Genetic Variant (GV) is associated with disease progression, then drug/chemical (which suppresses the GV or its gene product) is a possible treatment option for the associated disease. • If a GV is associated with disease remission, then drug/chemical (which increases activity of the GV) is a possible treatment option for the associated disease. • If a disease associated GV causes increased expression of certain receptors, a drug which suppresses this receptor will be a possible treatment option for the disease.
Resources for drug repositioning Drug Bank OMIM Gene expression based resources GEO EMR’s cMAP PubMed Disease-centric Drug-centric 834,730 samples 1307 compounds CTD Orphanet Network & Computational Modeling 20,000 genes & phenotypes 6711 drugs 4227 drug targets 18,414, 321 Toxicogenomic relationships 6500 rare diseases 22 millions citations Text-based resources Source (5): Adapted from Liu et al. Sept 2012
HIBM – a rare disease • Autosomal recessive disease • Clinical/diagnostic features • Proximal and distal muscle weakness (starts with distal) • Onset during late teens • Mild elevation of serum CK • Progression of muscular weakness continues for10-20 years • Spares the quadriceps • Detection of “inclusion bodies” in muscle biopsies • Rimmed vacuoles (clusters of autophagic vacuoles (AVs) and myeloid bodies) in muscle tissue • Accumulation of beta-amyloid, accumulation of NCAM1 in muscle (hyposialylation) • Intracellular deposition of Congo red-positive materials (such as b-amyloid and a-synuclein) • No loss of cognitive function
HIBM – a rare disease • Genetic characteristics • Caused by mutations in GNE at locus 9p13-p12 • homozygous or compound heterozygous • bi-functional enzyme, UDP-N-Acetylglucosamine 2-epimerase/N-AcetylmannosamineKinase • catalyzes two adjacent steps in the sialic acid biosynthetic pathway • feedback regulated • phosphorylated (PKC) and ubiquitinated • Associated with • abnormal phosphorylation of tau • activation of the ubiquitin proteasome system • activation of the lysosomal system
HIBM (is-a myopathy) (myopathy abnormality-of muscle-tissue) • Autosomal recessive(is-a genetic inheritance) disease • Clinical/diagnostic features • Proximal muscle weakness (OMIM) distal muscle weakness(OMIM)(starts with distal) • Onset during late teens • Mild elevation of (PATO) serum CK – (elevated creatine phosphokinase is-a elevated-enzyme-activity) • Progression of (PATO) muscular weakness continues for10-20 years • Spares (not affects) the quadriceps (Uberon) • Detection of “inclusion bodies” in muscle biopsies • Rimmed vacuoles(clusters of autophagic vacuoles (AVs) and myeloid bodies) in muscle tissue • Accumulation ofbeta-amyloid, accumulation of NCAM1 in muscle (hyposialylation) (decreased occurrence of sialic acid in) • Intracellular deposition of Congo red-positive materials (such as b-amyloid and a-synuclein) • No loss of cognitive function (cogpo) Why is it hard to create ontologies for cognitive functions?
HIBM – a rare disease • Genetic characteristics • Caused by mutations in GNEat locus 9p13-p12 • homozygous or compound heterozygous • bi-functional enzyme, UDP-N-Acetylglucosamine 2-epimerase/N-AcetylmannosamineKinase • catalyzestwo adjacent steps in the sialic acid biosynthetic pathway(BioPAX) • feedback regulated • phosphorylated(PKC) and ubiquitinated(ubiquitination – GO) • Associated with • abnormal phosphorylation(GO) of tau protein (PRO) • activation of the ubiquitin proteasome system • activation of the lysosomal system
C303X C303V G206S G206fsX4 V572L, “Japanese” (homozygous) P283S R306Q R202L G312R G559R G576E D225N V331A I200F I587T V216A I557T R246W R246Q V367I F528C A600T R177C I472T P36L V696M A630T I377fsX16 A524V D176V D378Y R263L P27S G134V G708S N519S A519S A460V A631T A631V M171V H132Q C13S M712T rs28937594 “middle eastern” (homozygous) R266W R266Q V421A R162C R129Q R420X T507P Y675H R11W 100 200 300 400 500 600 700 ManNAc 6-kinase UDP-GlcNAc 2-epimerase Zn binding M712-p (rat) K195-u Y22-p K267-u Active site Y197-p ATP binding K210-u ATP binding Nuclear Export Signal S199-p Allosteric Site UDP-GlcNAc 2-epimerase domain Black: mutations in uniprot Grey: mutations in papers ManNAc 6-kinase domain ATP binding site Allosteric site Substrate binding site Nuclear Export Signal Enzymatic active site Phosphorylation site -p Zn binding site Ubiquitination site -u
Human G206S V572L V572L D225N V331A A631T A631V A631V I200F V216A R246W R246Q I557T I472T G576E R177C R306Q V696M F528C D176V D378Y D378Y A460V A460V R263L I587T A524V M171V A600T Y675H R266W R266Q N519S P36L R162C ManNAc 6-kinase M712T rs28937594 UDP-GlcNAc 2-epimerase H132Q A630T P27S 100 200 300 400 500 600 700 Y197-p Y22-p Zn binding Allosteric Site M712-p K195-u S199-p Active site Nuclear Export Signal ATP binding K210-u ATP binding K267-u Kinase + + + + + -- Epimerase + -- -- -- -- -- H155A(rat) Oligamerization - - -- -- + + H157A (rat) H132A (rat) D413K (rat) D413N (rat) M712T (mouse) Feedback inhibition process V572L (mouse) + H49A (rat) H110A (rat) R420M (rat) 100 200 400 500 600 700 Rat ManNAc 6-kinase UDP-GlcNAc 2-epimerase Zn binding Active site G135E (CHO) ATP binding ATP binding (KO) tm1Rhk (KO) tm1Sngi Insert: HumanGNE*D176V)
Ontological Mapping of Findings to Sequences Sequence Types and Features Ontology
Ontological Model of Pathways using BioPAX • Pathway: a set or series of interactions, often forming a network
Exploring for related information • What genes are related to inclusion body myopathies?
What are the relevant phenotypes? • The human phenotype ontology • Arranged as a directed acylic graph (DAG) • A given phenotypic feature can be considered to be a more specific aspect or more than one parental term. • Terms that are located close to the root of the graph are less specific than terms that are farther away from it. • This is defined as the information content (IC) of a term (−log pi, where pi represents the frequency of the phenotypic manifestation i among all diseases in the database). • mental retardation, which is a common phenotypic manifestation of many hereditary diseases, is less clinically specific (has less information content) than a feature such as calcific stippling.
Comparing phenotypes Figure 3. Analysis of the phenotypic similarity of the Human Phenotype Ontology (HPO) terms downward slanting palpebral fissuresand hypertelorism to annotations of (a) Greigcephalopolysyndactyly syndrome [GCPS (MIM 175700)] and (b) type II orofaciodigital syndrome [OFD2 (MIM 252100)]. The most specific common ancestor of hypertelorism and telecanthus is the term abnormality of the eye, and the similarity betweenhypertelorism and telecanthus is calculated as the information content of the term abnormality of the eye. Therefore, a search with the query terms downward slanting palpebral fissures and hypertelorism yields a higher score for GCPS than for OFD2.
Phenotypic similarity using EQ • Recall phenotype description using EQ description
Phenotypic similarity IC of the node, which is the negative log of the probability of that description being used to annotate a gene, allele, or genotype (collectively called a feature) • Phenotypic Profile: Multiple EQ descriptions annotated to a genotype • Phenotypes annotated to genotypes are propagated to their allele(s), and in turn to the gene, indicated with upward arrows. • Similarity is analyzed between any two nodes of the same type, • gene A-vs-B, allele A3-vs-B1, genotypes A1/A1-vs-A3/A3, or A3/A3-vs-B1/B1. • The common subsuming phenotypes between A1/A1-vs-A3/A3 and gene A-vs-B are itemized in white boxes. Some individual phenotypic descriptions can have two common subsumers. • For each phenotypic description (EQ), the calculated IC is shown. • When comparing two items, four scores are determined: • maxIC, the maximum IC score for the common subsuming EQ, which may be a direct (in the case of A1/A1-vs-A3/A3) or inferred (in the case of gene A-vs-gene B) phenotype, • avgICCS, the average of all common subsuming IC scores • simIC, the similarity score which computes the ratio of the sum of IC values for EQ descriptions (including subsuming descriptions) held in common (intersection) to that of the total set (union) • simJ, non-IC-based similarity score calculated with the Jaccard algorithm which is the ratio of the count of all nodes in common to nodes not in common.
Phenoclustering • Phenotype and genotype information can viewed as a network • Graph clustering techniques with suitable similarity metrics can be used to define node proximity Phenoclustering: online mining of cross-species phenotypes Grothet al, Bioinformatics 2010 26(15): 1924.
Investigating the hypothesis • Exploratory Search • A specialization of information exploration which represents the activities carried out by searchers who are • Unfamiliar with the domain of their goals • Unsure about the ways to achieve their goals • Possibly even unsure about their exact goals • Hypothesis investigation can be viewed as an exploratory search over a semantically connected graph • Find entities of type drug that relate to one or more of these genes, possibly through these pathways, and possibly through these phenotypes • Distinct from finding statistically correlated information and thresholding on p-values
Role of ontologies in exploratory graph search • Ontologies serve as indices to data • Semantic labels as indices • Relationships as join indices • Ontological neighborhoods as multi-join indices • Helps to construct “semantic neighborhoods” between data nodes that are far apart • Ontologies as (implicit) query filters • Find connections in the data graph only when the corresponding ontology entities satisfy a connectivity pattern • Node/Node Type distances can denote node similarities • Can be a function of graph distances in the ontology • Can be extended to define relatedness measures between data neighborhood
Example • Exploratory Query • Find drug:* related-togene:GNE, through some pathways, and optionallythrough some muscular dystrophy • A potential exploration path • GNE missense mutations of GNE reduced GNE-epimeraseactivities GNE/MNK pathway ManNAC kinase clinical trials drug DEX-M4 • Exercise: how can ontologies contribute to finding this path?
Conclusions • Upper ontologies are needed to organize concepts and relationships for a domain and application • Principled methods of modularizing component ontologies help avoid large monolithic ontologies and potential inconsistencies • Ontologies are not only used for conceptualizing a domain but also for tasks like data integration, enrichment analysis and (exploratory) search