850 likes | 1.2k Views
-Ontologies: Bio-Ontologies: Their Creation and Design . Dr. Peter Karp SRI, http://www.ai.sri.com/~pkarp/ Dr. Robert Stevens & Professor Carole Goble University of Manchester, UK http://img.cs.man.ac.uk/tambis. Advertisement. The Fourth Annual Bio-Ontologies Meeting
E N D
-Ontologies: Bio-Ontologies: Their Creation and Design Dr. Peter Karp SRI, http://www.ai.sri.com/~pkarp/ Dr. Robert Stevens & Professor Carole Goble University of Manchester, UK http://img.cs.man.ac.uk/tambis
Advertisement The Fourth Annual Bio-Ontologies Meeting "Sharing Experiences and Spreading Best Practice” Sponsored by GlaxoSmithKline Pharmaceuticals Tivoli Gardens, Copenhagen, Denmark, 26th July 2001 Organised by: Richard Chen, Carole Goble, Robert Stevens, Peter Karp, Pat Hayes, Robin McEntire and Eric Neumann. http://img.cs.man.ac.uk/stevens/workshop01
Outline • What is an ontology? • Motivation for ontologies in bioinformatics • Definition of an ontology • Naming the parts & comparing the types • Knowledge representation • Building an ontology • Methodologies, pprinciples and pitfalls • Running example: a macromolecule fragment • Ontology Tools • Development tools
Outline • Motivations for ontologies in bioinformatics • Definition of ontology • Principles and pitfalls of ontology design • GKB Editor ontology development tool
Definition of an Ontology • Conceptualization of a domain of interest • Concepts, relations, attributes, constraints, objects, values • An ontology is a specification of a conceptualization • Formal notation • Documentation • A variety of forms, but includes: • A vocabulary of terms • Some specification of the meaning of the terms • Ontologies are defined for reuse
Roles of Ontologies in Bioinformatics • Success of many biological DBs depends on • High fidelity ontologies • Clearly communicating their ontologies • Prevent errors on data entry and interpretation • Common framework for multidatabase queries • Controlled vocabularies for genome annotation • Riley ontology, GO • EC numbers
Roles of Ontologies in Bioinformatics • Information-extraction applications • Reuse is a core aspect of ontologies • Reuse of existing ontologies faster than designing new ones • Reuse decreases semantic heterogeneity of DBs • Schema-driven Software • Knowledge-acquisition tools • Query tools
Definitions • Data Model: • Primitive data structuring mechanism in which an ontology is expressed • Relational data model, object-oriented data model, frame data model • Ontology: • Domain specific conceptualization expressed within some data model
Components of an Ontology • Concepts • AKA: Class, Set, Type, Predicate • Gene, Reaction, Macromolecule • Taxonomy of concepts • Generalization ordering among concepts • Concept A is a parent of concept B iff every instance of B is also an instance of A • Superset / subset • “A kind of” vs “a part of”
Components of an Ontology • Objects • AKA: Instances, members of the set • trpA Gene, Reaction 1.1.2.4 • Strictly speaking, an ontology with instances is a knowledge base • Relations and Attributes • AKA: Slots, properties • Product of Gene, Map-Position of Gene • Reactants of Reaction, Keq of Reaction • Values • The Product of the trpA Gene is tryptophan-synthetase • trpA.Product = tryptophan-synthetase
Components of an Ontology • Constraints and other meta information about relations • Slot Product: • Value type: Poypeptide or RNA • Domain: Genes • Slot Map-Position: • Value type: Number • Domain: Genes • Cardinality: At-Most 1 • Range: 0 <= X <= 100 • General Axioms • Nucleic acids < 20 residues are oligonucleiotides
More on Concepts • Primitive: properties are necessary • Globular protein must have hydrophobic core, but a protein with a hydrophobic core need not be a globular protein • Defined: properties are necessary + sufficient • Eukaryotic cells must have a nucleus. Every cell that contains a nucleus must be Eukaryotic.
Ontology Subtypes Expressiveness • Controlled vocabulary • List of terms • Taxonomy • Terms in a generalization hierarchy • DB schemas (relational and object-oriented) • More implementation specific • No instance information • Limited constraints • Frame knowledge bases • Description Logics
Ontology Subtypes • Database schema • Concepts, relations, constraints • Perhaps no taxonomy • At most hundreds of concepts • Taxonomy • Concepts, taxonomy, perhaps a few relations • Thousands of concepts • Knowledge base • Concepts, relations, constraints, objects, values • Hundreds to hundreds of thousands of concepts and objects
Ontology Subtypes • Generic (a.k.a. upper, core or reference) • common high level concepts • “Physical”, “Abstract”, “Structure”, “Substance” • useful for ontology re-use • important when generating or analysing natural language expressions • Domain-oriented • domain specific (e.g. E.coli) • domain generalisations (e.g. gene function) • Task-oriented • task specific (e.g. annotation analysis) • task generalisations (e.g. problem solving)
Knowledge Representation • Ontology are best delivered in some computable representation • Variety of choices with different: • Expressiveness • The range of constructs that can be used to formally, flexibly, explicitly and accurately describe the ontology • Ease of use • Computational complexity • Is the language computable in real time • Rigour • Satisfiability and consistency of the representation • Systematic enforcement mechanisms • Unambiguous, clear and well defined semantics • A subclassOf B don’t be fooled by syntax!
Languages • Vocabularies using natural language • Hand crafted, flexible but difficult to evolve, maintain and keep consistent, with poor semantics • Gene Ontology • Object-based KR: frames • Extensively used, good structuring, intuitive. Semantics defined by OKBC standard • EcoCyc (uses Ocelot) and RiboWeb (uses Ontolingua) • Logic-based: Description Logics • Very expressive, model is a set of theories, well defined semantics • Automatic derived classification taxonomies • Concepts are defined and primitive • Expressivity vs. computational complexity balance • TAMBIS Ontology (uses FaCT)
Vocabularies: Gene Ontology • Hand crafted with simple tree-like structures • Position of each concept and its relationships wholly determined by a person • Flexible but… • Maintenance and consistency preservation difficult and arduous • Poor semantics • Single hierarchies are limiting
Description Logics • Describe knowledge in terms of concepts and relations • Concept defined in terms of other roles and concepts • Enzyme = protein which catalyses reaction • Reason that enzyme is a kind of protein • Model built up incrementally and descriptively • Uses logical reasoning to figure out: • Automatically derived (and evolved) classifications • Consistency -- concept satisfaction
Frames and Logics • Frames • Rich set of language constructs • Impose restrictive constraints on how they are combined or used to define a class • Only support primitive concepts • Taxonomy hand-crafted • Description logics • Limited set of language constructs • Primitives combined to create defined concepts • Taxonomy for defined concepts established through logical reasoning • Expressivity vs. computational complexity • Less intuitive • Ideal: both! Current OIL activity uses a mixture. Logics provide reasoning services for frame schemes.
Ontology Exchange • To reuse an ontology we need to share it with others in the community • Exchanging ontologies requires a language with: • common syntax • clear and explicit shared meaning • Tools for parsing, delivery, visualising etc • Exchanging the structure, semantics or conceptualisation?
Frames: modelling primitives, OKBC Description Logics: formal semantics & reasoning support OIL Web languages: XML & RDF based syntax Ontology Exchange Languages • XOL eXtensible Ontology Language • XML markup • Frame based • Rooted in OKBC • http://www.ai.sri.com/pkarp/xol/ • OIL Ontology Interface LayerOntology Inference Layer • Gives a semantics to RDF-Schema • http://www.ontoknowledge.org/oil
OIL: Ontology Metadata (Dublin Core) Ontology-container title “macromolecule fragment” creator “robert stevens” subject “macromolecule generic ontology” description “example for a tutorial” description.release “2.0” publisher “R Stevens” type “ontology” formal “pseudo-xml” identifier “http://www.ontoknowledge.org/oil/oil.pdf” source “http://img.cs.man.ac.uk/stevens/tambis-oil.html” language “OIL” language “en-uk” relation.haspart “http://www.ontoRus.com/bio/mmole.onto”
The Three Roots of OIL Description Logics: Formal Semantics & Reasoning Support Frame-based Systems: Epistemological Modelling Primitives OIL Web Languages: XML- and RDF-based syntax
OIL primitive ontology definitions slot-def has-backbone inverse is-backbone-of slot-def has-component inverse is -component-of properties transitive class-def nucleic-acid class-def rna subclass-of nucleic-acid slot-constraint has-backbone value-type ribophosphate class-def ribophosphate class-def deoxyribophosphate subclass-of NOT ribophosphate
OIL defined ontology definitions class-def defined dna subclass-of nucleic-acid AND NOT rna slot-constraint has-backbone value-type deoxyribophosphate class-def defined enzyme subclass-of protein slot-constraint catalyse has-value reaction class-def defined kinase subclass-of protein slot-constraint catalyse has-value phosphorylation-reaction
OIL in XML • OIL has a DTD, an XML Schema and a mapping to RDF-Schema. See web site for details <slot-def> <slot-name = “has-component”/> <inverse> <slot-name = “is-component-of”/> </inverse> <properties> <transitive/> </properties> </slot-def> <class-def> <class-name= “nucleic-acid”/></class-def> <class-def> <class-name= “rna”/> <subclass-of> <class name = “nucleic-acid”/> </subclass-of> <slot-constraint> <slot-name = “has-backbone”/> <value-type> <class name= “ribophosphate” </value-type> </slot-constraint> </class-def>
OIL Remarks • Tools: • Protégé II editor • FaCT reasoner • Other projects: • Semantic Web projects (http://www.semanticweb.org) • Agents for the web projects (e.g. DAML) A knowledge representation language and inference mechanism for the web
OIL Features • Based on standard frame languages • Extends expressive power with DL style logical constructs • Still has frame look and feel • Can still function as a basic frame language • OILcore language restricted in some respects so as to allow for reasoning support • No constructs with ill defined semantics • No constructs that compromise decidability • Has both XML and RDF(S) based syntax
OIL Features • Semantics clearly defined by mapping to very expressive Description Logic, e.g.: • slot-constraint reverse-transcribe-from has-valuemRNA or (part-of has-value mRNA) • eats.meat eats.fish • Note the importance of clear semantics: • eats.(meat fish) • is inconsistent (assuming meat and fish are disjoint) • Mapping can also be used to provide reasoning support from a Description Logic system (e.g., FaCT)
Why Reasoning Support? • Key feature of OIL core language is availability of reasoning support • Reasoning intended as design support tool • Check logical consistency of classes • Compute implicit class hierarchy • May be less important in small local ontologies • Can still be useful tool for design and maintenance • More important with larger ontologies/multiple authors • Valuable tool for integrating and sharing ontologies • Use definitions/axioms to establish inter-ontology relationships • Check for consistency and (unexpected) implied relationships • Already shown to be useful technique for DB schema integration
DAML+OIL • OIL merged with DAML • Originally retained frame syntax • DAML more concerned with deploymnent rather than building and managing • OIL mapped to DAML+OIL, but not reliably reversed • FRAME look and feel may return • Web ontology language
Building Ontologies • No field of Ontological Engineering equivalent to Knowledge or Software Engineering; • No standard methodologies for building ontologies; • Such a methodology would include: • a set of stages that occur when building ontologies; • guidelines and principles to assist in the different stages; • an ontology life-cycle which indicates the relationships among stages. • Gruber's guidelines for constructing ontologies are well known.
The Development Lifecycle • Two kinds of complementary methodologies emerged: • Stage-based, e.g. TOVE [Uschold96] • Iterative evolving prototypes, e.g. MethOntology [Gomez Perez94]. • Most have TWO stages: • Informal stage • ontology is sketched out using either natural language descriptions or some diagram technique • Formal stage • ontology is encoded in a formal knowledge representation language, that is machine computable • An ontology should ideally be communicated to people and unambiguously interpreted by software • the informal representation helps the former • the formal representation helps the latter.
A Provisional Methodology • A skeletal methodology and life-cycle for building ontologies; • Inspired by the software engineering V-process model; • The overall process moves through a life-cycle. The left side charts the processes in building an ontology The right side charts the guidelines, principles and evaluation used to ‘quality assure’ the ontology
Ontology in Use The V-model Methodology Evaluation: coverage, verification, granularity Identify purpose and scope Knowledge acquisition User Model Conceptualisation Principles: commitment, conciseness, clarity, extensibility, coherency Conceptualisation Integrating existing ontologies Conceptualisation Model Encoding/Representation principles: encoding bias, consistency, house styles and standards, reasoning system exploitation Encoding Representation Implementation Model
The ontology building life-cycle Identify purpose and scope Knowledge acquisition Building Language and representation Conceptualisation Integrating existing ontologies Available development tools Encoding Evaluation
User Model: Identify purpose and scope • Decide what applications the ontology will support • EcoCyc: Pathway engineering, qualitative simulation of metabolism, computer-aided instruction, reference source • TAMBIS: retrieval across a broad range of bioinformatics resources • The use to which an ontology is put affects its content and style • Impacts re-usability of the ontology
User Model: Knowledge Acquisition • Specialist biologists; standard text books; research papers and other ontologies and database schema. • Motivating scenarios and informal competency questions – informal questions the ontology must be able to answer • Evaluation: • Fitness for purpose • Coverage and competency
Ontology Scenario • A molecule ontology; • Describes the molecules stored in bioinformatics databases and annotated therein; • It should cover the molecules and other chemicals described in the resources; • The ontology will be used for querying and annotating information in bioinformatics resources.
Competency Questions • Cover the macromolecules found in molecular biology resources and courses; • Should accommodate various views on the macromolecules; • should cover the queries people want to ask of macromolecules; • In reality, need more detail on these questions- “give me tRNA genes with anticodon x, from aardvark”.
Acquiring Knowledge • Find your knowledge! • An important source is your head, but… • Use text books, glossaries (many of which lie on the web) and domain experts; • Use other ontologies – what did they include and how did they do it? • Record your sources of knowledge. • Use your competency questions;
Starting Concept List • Chemicals – atom, ion, molecule, compound, element; • Molecular-compound, ionic-compound, ionic-molecular-compound, …; • Ionic-macromolecular-compound and ionic-msall-macromolecular-compound; • Protein, peptide, polyprotein, enzyme, holo-protein, apo-protein,… • Nucleic acid – DNA, RNA, tRNA, mRna, snRNA, …
Conceptualisation Model: Conceptualisation • Identify the key concepts, their properties and the relationships that hold between them; • Which ones are essential? • What information will be required by the applications? • Structure domain knowledge into explicit conceptual models. • Identify natural language terms to refer to such concepts, relations and attributes;