530 likes | 673 Views
Ontology-Based Knowledge Discovery and Sharing in Biological and Medical Research. Jingshan Huang Assistant Professor School of Computer and Information Sciences University of South Alabama http://cis.usouthal.edu/~huang/. Dept. of Chemical Pathology @ CUHK Hong Kong August 17, 2010.
E N D
Ontology-Based Knowledge Discovery and Sharing in Biological and Medical Research Jingshan Huang Assistant Professor School of Computer and Information Sciences University of South Alabama http://cis.usouthal.edu/~huang/ Dept. of Chemical Pathology @ CUHK Hong Kong August 17, 2010
Presentation Outline Research Motivation Ontologies and Ontological Techniques Apply Ontological Techniques into Biological and Medical Research Ongoing Research – OMIT Project
Research Motivation – Overview • Information from heterogeneous sources has different semantics Long (English) Long (Chinese Pinyin) -> 龙 (龍) -> • Knowledge discovery and sharing in biological/medical research is both important and challenging • Integrating the information from heterogeneous sources must make use of all available clues, including syntax, semantics, context, and pragmatics • Ontologies are a formal model to encode semantics • Ontological techniques are critical in knowledge acquisition
Research Motivation – More Details Why??? • In medical informatics area, an abundance of digital data has possibly promised a profound impact in knowledge discovery and innovation • Worldwide health scientists are producing, accessing, analyzing, integrating, and storing massive amounts of digital medical data daily • Such data was obtained through observation, experimentation, and simulation • If we were able to effectively transfer and integrate data from all possible resources, then it is possible to obtain: • a deeper understanding of all these data sets, • better exposed knowledge, and • appropriate insights and actions • Unfortunately, in many cases, the data users are not the data producers • They thus face challenges in harnessing data in unforeseen and unplanned ways
Research Motivation – An Example Scenario • The identification and characterization of important roles microRNAs (miRNAs) played in human cancer is an increasingly active area • In particular, it is very challenging to effectively identify miRNAs’ target genes • Cancer patients’ prognosis depends largely on their chemosensitivity (sensitivity to chemotherapy) • Research has discovered that some specific genes increase the permeability of mitochondria (a cellular component) membrane, which in turn leads to apoptosis (cell death) • As a result, the patient’s chemosensitivity will increase and the chemotherapy will be more effective • Certain miRNAs can regulate the aforementioned genes and thus affect cancer patients’ prognosis • If biologists were able to identify such miRNAs, a breakthrough on cancer treatment would have been made Unfortunately, such identification is very difficult…
Research Motivation – An Example Scenario (cont.) • Biologists need to extract a large number of candidate target genes from existing miRNA databases • They will also have to manually search these genes’ related information from resources other than miRNA databases for every one of hundreds of candidate target genes • cellular component • biological process • and so on… • In a word, the whole process is time-consuming, error-prone, and subject to biologists’ limited prior knowledge • In addition, such a situation could be even worse • It is further aggravated by great complexity and imprecise terminologies, which characterize typical biological and biomedical research fields • A great deal of variety has been identified in the adoption of different biological terms, along with different relationships among all these terms • Such variety has inhibited effective information acquisition by humans
The biological and medical research area is facing a challenging problem: knowledge discovery and sharing among distributed parties In order to integrate heterogeneous data, and thereby efficiently revolutionize the traditional medical and biological research, new methodologies are in great need As a formal knowledge representation model, ontologies play a key role in defining formal semantics in traditional knowledge engineering Conclusion: It is necessary to apply ontological techniques into the biological and medical research investigation Research Motivation – Summary
Presentation Outline Research Motivation Ontologies and Ontological Techniques Apply Ontological Techniques into Biological and Medical Research Ongoing Research – OMIT Project
Definition of Ontologies • The simplest definition: An ontology is a computational model (a.k.a. knowledge representation model) of some domain of the world • It describes the semantics of the terms (a.k.a. concepts) used in the domain • It is often captured in the form of DAG (directed acyclic graph) What is a DAG then? • Nodes represent ontology concepts while arcs represent their relationships • May be augmented by rules, constraints, or functions • In brief, ontologies aim to make explicit the knowledge contained within software applications for a particular domain: An ontology = a finite set of concepts + properties + relationships • Such graphical structures are also known as ontology schemas • Actual data sets contained in these schemas are referred to as instances • Most real-world ontologies have very few or no instances at all
Ontology Engineering • The creation and maintenance of ontologies in the domain of interest • In other words, it focuses on the methodologies by which to build ontologies • To create an ontology, three different approaches can be applied • Top-down approach (knowledge driven) • Bottom-up approach (data/inference driven) • Combination of top-down and bottom-up • Languages to represent ontologies in computer systems • OWL (Web Ontology Language) – most popular one • Open Biological and Biomedical Ontologies (OBO) • Knowledge Interchange Format (KIF) • Open Knowledge Base Connectivity (OKBC) • GUI tools for ontology engineering • Protégé (by Stanford) – most popular one • CmapTools (by IHMC) • OntoEdit (by Ontoprise)
Ontology Heterogeneity • Heterogeneity is an important, inherent characteristic of ontologies developed by different parties for the same (or similar) domains • This is due to the fact that ontologies reflect their designers’ different conceptual models for some domain • The heterogeneous semantics may occur in different ways • different terms could be used for the same concept; • an identical term could be adopted for different concepts; • properties and relationships could be different As a result, Ontology Matching has become an increasingly active topic
Ontology Matching • “Ontology Matching” is short for “Ontology Schema Matching” • Also known as “Ontology Alignment” or “Ontology Mapping” • It refers to the process of determining correspondences between concepts from heterogeneous ontologies • It aims to handle the aforementioned challenge in ontology heterogeneity • Many different relationships will be involved • equivalentWith • subClassOf • superClassOf • siblings • and so on…
Current Ontology-Matching Algorithms Rule-Based Matching Consider schema information alone Specify a set of rules Apply them to schema information Learning-Based Matching Consider both schema and instances Apply different machine learning techniques Brief Introduction of Machine Learning A scientific discipline that is concerned with the design and development of some special algorithms These algorithms allow computers to change behavior based on “training data” The major focus is to recognize complex patterns and make intelligent decisions
Pros and Cons for Current Approaches • Rule-Based Matching • Is relatively fast () • Ignores instance information () • Uses ad hoc predefined weights () concept semantics: name + properties + relationships • Learning-Based Matching • Obtains extra clues from instances () • Runs longer () • Has difficulty in getting sufficient instances () most real-world ontologies do not have instances
Presentation Outline Research Motivation Ontologies and Ontological Techniques Apply Ontological Techniques into Biological and Medical Research Ongoing Research – OMIT Project
Ontological Techniques in Bio Research • Ontological techniques have been widely applied to medical and biological research • The most successful example is the Gene Ontology (GO) project • Unified Medical Language System (UMLS) and the National Center for Biomedical Ontology (NCBO) are two other successful examples • Besides, efforts have been carried out for ontology-based data integration in bioinformatics and medical informatics
Why Gene Ontology (GO) Project? • Biologists have wasted a lot of time and effort in searching for all of the available information about each small area of research • It is further hampered by the wide variations in terminology that may be common usage at any given time • A simple example: if you were searching for new targets for antibiotics, you might want to find all the gene products that are involved in bacterial protein synthesis • Suppose that one database describes these molecules as being involved in “translation”, whereas another uses the phrase “protein synthesis” • It will then be difficult for human to find functionally equivalent terms, let alone any computer software • As an effort to address the need for consistent descriptions of gene products in different databases, the GO began as a collaboration between three model organism databases (Flies, Saccharomyces, and Mouse) in 1998 • The GO Consortium has grown to include many databases, including several of the world’s major repositories for plant, animal, and microbial genomes
Three Sub-Ontologies in the GO • Cellular Component, Biological Process, and Molecular Function • A gene product might: • be associated with or located in one or more cellular components; • be active in one or more biological processes; • during which it performs one or more molecular functions Example The gene product, cytochrome c , can be described by: • the molecular function term “oxidoreductase activity” • the biological process terms “oxidative phosphorylation” and “induction of cell death” • the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
GO Structure • The GO ontology is essentially a Hierarchy-Like DAG • In other words, each node is a GO term, and each arc represents a relationship between two GO terms • Directed feature For example, a mitochondrion is an organelle, but not vice versa • Acyclic feature (cycles are not allowed) For example, it is inappropriate to specify that “A1 is an A2” “A2 is an A3” … “Ai is an A1” • Hierarchy-Like feature (generalized-specialized relationship plus possibly multiple parents) For example, the biological process term hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process (biosynthetic process is a type of metabolic process and a hexose is a type of monosaccharide)
Three Relationships in the GO • The GO ontology defines three different relationships among terms • is a , a.k.a. is a subtype of, represented as ; • part of , represented as ; and • regulates , represented as Note that regulates includes two sub-relationships, i.e., negatively regulates and positively regulates, represented as and , respectively
is a Relationship in the GO • If A is a B, it means that A is a subtype of B • For example, mitotic cell cycle is a cell cycle • Another example, lyase activity is a catalytic activity • The difference between is a relationship and “is an instance of” (meaning that a specific example of something), for example: • A cat is a mammal • George is an instance of a cat, therefore, the claim that “George is a cat” is incorrect • However, it is safe to claim that every one of the instances of a cat is also an instance of a mammal
Reasoning over is a Relationship • The is a relationship is transitive: • Example
part of Relationship in the GO • B is part of A, meaning that the presence of B implies the presence of A • But not vice versa, i.e., given the presence of A, we cannot conclude the presence of B • In other words • all B are part of A • but only some A have part B • Example
Reasoning over part of Relationship (1) • The part of relationship is also transitive: • Example
Reasoning over part of Relationship (2) • part of followed by is a : • Example
Reasoning over part of Relationship (3) • part of following is a : • Example
Reasoning over part of Relationship (4) • The aforementioned logical rules regarding the part of and is a relationships hold no matter how many intervening is a and part of relationships are there • Example
regulates Relationship in the GO • B regulates A, meaning that the presence of B implies the presence of A • But not vice versa, i.e., given the presence of A, we cannot conclude the presence of B • In other words • all B regulate A • but only some A are regulated by B • Example
Reasoning over regulates Relationship (1) • Both negatively regulates and positively regulates imply regulates • Example
Presentation Outline Research Motivation Ontologies and Ontological Techniques Apply Ontological Techniques into Biological and Medical Research Ongoing Research – OMIT Project
Ongoing Research: OMIT Project http://omit.cis.usouthal.edu/ BesidesSun Lab at CUHK, there are five other collaborating labs from around the world
Project Overview An innovative computing framework based on the Ontology for MicroRNA Target Prediction (OMIT) to handle the aforementioned challenge in predicting miRNAs’ target genes The OMIT is a domain-specific ontology upon which it is possible to facilitate knowledge discovery and sharing from existing sources The long-term research objective of the OMIT framework is to assist biologists in unraveling important roles of miRNAs in human cancer, and thus to help clinicians in making sound decisions when treating cancer patients We aim to synthesize data from existing source miRNA databases into a comprehensive conceptual model that permits an emphasis on data semantics Consequently, a more accurate, complete view of miRNAs’ biological functions can be acquired We thus provide users with a single query engine that takes their needs in a nonprocedural specification format
Five Tasks in the OMIT Project To develop a miRNA-domain-specific ontology that contains a set of OMIT concepts, along with the relationships among these concepts To align the OMIT with the GO so that gene-related information can be automatically acquired and integrated To annotate source miRNA databases with OMIT concepts for existing databases to be enriched with formal semantics To integrate OMIT-annotated miRNA databases into a centralized RDF data warehouse To perform complicated search/query in a unified style so that deep knowledge can be obtained out of a wealth of miRNA data
An Example Research Scenario Suppose a cancer biologist is interested in investigating the chemosensitivity of breast cancer cells By comparing chemosensitive and chemoresistant cancer cells it is demonstrated that miR-125b, a specific miRNA, may confer the increased chemosensitivity of cancer cells After the OMIT system obtains candidate targets for miR-125b, the gene information of these targets will be further acquired, including cellular localization (e.g., in mitochondria) and biological process (e.g., apoptosis) The availability of such integrated knowledge will make it much easier for the cancer biologist to deduct the actual targets for miR-125b As a result, a breakthrough in breast cancer treatment may be granted
A Typical Knowledge Acquisition Cycle Steps 1-3: the user initiates a search/query; recognized miRNA concept is used to query the RDF data warehouse Steps 4-5: miRNA targets are retrieved and utilized to acquire more gene information Steps 6-8: miRNA targets and their related gene information are returned to the user Corresponding RDF-based query: SELECT DISTINCT OMIT:targetGene FROM OMIT:miRNA, GO-CC:cellComponent, GO-BP:bioProcess WHERE OMIT:miRNA ID = “miR-125b” AND OMIT:miRNA targetID = GO-CC:cellComponent geneID AND OMIT:miRNA targetID = GO-BP:bioProcess geneID AND GO-CC:cellComponent localization = “mitochondria” AND GO-CC:cellComponent permeabilityIncrease = “yes” AND GO-BP:bioProcess apoptosisIncrease = “yes” USING NAMESPACE OMIT = <http://omit.cis.usouthal.edu/ontology/OMIT.owl>, GO-CC = <http://www.geneontology.org/formats/oboInOwl#>, GO-BP = <http://www.geneontology.org/formats/oboInOwl#>.
Linkage between the OMIT and the GO Some OMIT concepts are directly inherited and extended from GO concepts For example, OMIT concept GeneExpression is designed to describe miRNAs’ regulation of gene expression. This concept is inherited from concept gene expression in the BiologicalProcess ontology. This way, subclasses of gene expression, such as negative regulation of gene expression, are then accessible in the OMIT for describing the negative gene regulation of miRNAs in question Some OMIT concepts are equivalent to (or similar to) GO concepts For example, OMIT concept PathologicalEvent and its subclasses are designed to describe biological processes that are disturbed when a cell becomes cancerous. Although not immediately inherited from any specific GO concepts, these OMIT concepts do match up with certain concepts in the BiologicalProcess ontology. OMIT concepts TargetGene and Protein are two other examples, which correspond to individual genes and individual gene products, respectively, in the GO
OMIT Summary It is an innovative computing framework based on the miRNA-domain-specific ontology It aims to handle the challenge of predicting miRNAs’ target genes The OMIT is the very first ontology in the miRNA domain It will assist biologists in unraveling important roles of miRNAs in human cancer, and thus help clinicians in making sound decisions when treating cancer patients Such long-term research goal will be achieved via facilitating knowledge discovery and sharing from existing sources The first version OMIT ontology has been added into NCBO BioPortal (http://bioportal.bioontology.org/ontologies/42873) Updates are available at the project website: http://omit.cis.usouthal.edu/