360 likes | 486 Views
Introduction to Ontology. Barry Smith August 11, 2012. The problem of (big) data. Some questions. How to find data? How to understand data when you find it? How to use data when you find it? How to integrate with other data? How to label the data you are collecting?
E N D
Introduction to Ontology Barry Smith August 11, 2012
Some questions • How to find data? • How to understand data when you find it? • How to use data when you find it? • How to integrate with other data? • How to label the data you are collecting? • How to build a set of labels for a new domain that will integrate well with labels used in neighboring domains? Big problem: nearly all of this data is siloed
Sources • Examples of databases containing person data and data pertaining to skills
The problem: many, many silos • DoDspends more than $6B annually developing a portfolio of more than 2,000 business systems and Web services • these systems are poorly integrated • deliver redundant capabilities, • make data hard to access, foster errorand waste • prevent secondary uses of data https://ditpr.dod.mil/ Based on FY11 Defense Information Technology Repository (DITPR) data
One road to a solution: Exploit the network effects of the Web • You build a site. • Others discover the site and they link to it • The more they link to it, the more important and well known the page becomes (this is what Google exploits) • Your page becomes important, and others begin to rely on it • Many people link to the data, use it • New ‘secondary uses’ of the data are discovered With thanks to Ivan Herman
Unfortunately the Web is ruled by anarchy. However much we try to link web content together à la google, we will still be left with many, many siloes. Photo credit “nepatterson”, Flickr
The idea of the Semantic Web To avoid silos, data must be available on the Webin a standard way. Use “ontologies” to capture common meanings with logical definitions that are understandable to both humans and computers. using a common language such as OWL (Web Ontology Language)
Annotate data using ontologies Inconsistent and idiosyncratic terms used in source data are associated with single preferred labels from ontologies
Where we stand today • html demonstrated the power of the Web to allow sharing of information • increasing availability of semantically enhanced data • increasing power of semantic software to allow automatic reasoning over online information • increasing use of OWL in attempts to break down silos, and create useful integration of on-line data and information
Ontology success stories, and some reasons for failure unfortunately this data is not really linked
Ontology success stories, and some reasons for failure unfortunately this data is not really linked
The result: the more Semantic Technology is successful, they more it fails to achieve it goals the very success of the approach leads to the creation of ever new controlled vocabularies , semantic silos – because multiple ontologies are being created in ad hoc ways The Semantic Web framework as currently conceived yields minimal standardization Creates semantic siloes
Basic Formal Ontology (BFO) top-level architecture used in over 120 ontology projects world wide Next tutorial in this series: August 18-19, 2012 http://ncorwiki.buffalo.edu/index.php/Basic_Formal_Ontology_2.0
People will tell you, all you need is … XML gives you: processabletagging + syntactic interoperability RDF gives you: net-centricity(URIs for unique and consistent naming), linked data OWL (Web Ontology Language) gives you:RDF + semantic interoperability, richer logic
Levels of coordination but these are just tools: • they do not rule out stovepipes • they do not prevent redundant efforts • they do not imply high quality ontologies of the sort that will support reasoning Even if we all speak Irish, thus does not mean that we all understand each other
Warning 1. • OWL implementation is not enough • the issues we face are not only logical, but also sociological • they are the same issues already endemic in the database world • database architecture is inflexible • database systems, once distributed, degrade very quickly; create stovepipes, forking, siloes … • How to ensure coordinated ontology development over time?
Suggested principles for an ontologist’s code of ethics • I hereby swear that I will reuse existing ontology content wherever possible • I hereby swear that whenever I reuse terms from an existing ontology, I will keep their original source IDs • I hereby swear that before releasing an ontology I will aggressively test it in multiple independent real-world applications • I hereby swear that before committing a new term and definition to an ontology I will always think first
Some governance principles • Information sharing: to avoid ontology redundancy and inconsistency, there must be sharing of information at every stage • Collaborative development: where ontology development needs overlap, the communities involved must either develop shared resources or agree to a division of labor • Leverage of existing resources: ontology development should wherever possible involve reuse of existing ontologies. • Guiding role of subject-matter experts, who should be involved in the construction and maintenance of all domain ontology content
Warning 2. Ontology is a multi-disciplinary enterprise, in which the same terms are used in conflicting ways by different communities of ontologies • universal, type, kind, class • instance • concept, model • representation • datum
The ontology spectrum (data focus) glossary: A simple list of terms and their definitions. data dictionary: Terms, definitions, naming conventions and representations of the data elements in a computer system. data model (e.g. JC3IEDM): Terms, definitions, naming conventions, representations and the beginning of specification of the relationships between data elements. taxonomy: A complete data model in an inheritance hierarchy where all data elements inherit their behaviors from a single "super data element". ontology: A complete, machine-readable specification of a conceptualization = conceptual data model
The ontology spectrum (reality focus) glossary: A simple list of terms and their definitions. controlled vocabulary: A simple list of terms, definitions and naming conventions to ensure consistency. taxonomy: A controlled vocabulary in which the terms form of a hierarchical representation of the types and subtypes of entities in a given domain. The hierarchy is organized by the is_a (subtype) relation ontology: A controlled vocabulary organized by is_a and by further formally defined relations, for example part_of.
Organ Part Organ Subdivision Anatomical Space Anatomical Structure Organ Cavity Subdivision Organ Cavity Organ Organ Component Serous Sac Tissue Serous Sac Cavity Subdivision Serous Sac Cavity is_a Pleural Sac Pleura(Wall of Sac) Pleural Cavity part_of Parietal Pleura Visceral Pleura Interlobar recess Mediastinal Pleura Mesothelium of Pleura FMA Foundational Model of Anatomy
In graph-theoretical terms: Ontology Components: • alphanumeric IDs form nodes of the graph • each node is associated with some single term (preferred label) • relationships between nodes, such as is_a form the edges of the graph • definitions and synonyms are associated with each node
Entity =def anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software
instances universals
Catalog vs. inventoryOntology vs. list of items in your warehouse
Warning 3.Do not confuse things with wordsand ideas • Level 1: the entities in reality, both instances and universals • Level 2: cognitive representations of this reality on the part of scientists ... • Level 3: publicly accessible concretizations of these cognitive representations in textual and graphical artifacts
Ontology development starts with: Level 2 = the cognitive representations of practitioners or researchers in the relevant domain results in: Level 3 representational artifacts (comparable to maps, science texts, dictionaries)
Domain =def. a portion of reality that forms the subject-matter of a single science or technology or mode of study; proteomics HIV demographics ...
Representation =def. an image, idea, map, picture, name or description ... of some entity or entities two kinds of representation: analogue (photographs) digital/composite/syntactically structured
Class =def. a maximal collection of particulars referred to by a general term the class A =def. the collection of all particular A’s where ‘A’ is a general term (e.g. ‘brother of Elvis fan’, ‘cell’) Classes are on the same level as the instances which they contain
(Scientific) Ontology =def. a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent 1. universals in reality 2. those relations between these universals which obtain universally (= for all instances) lung is_a anatomical structure lobe of lung part_of lung
Ontology (science) the science of the kinds and structures of objects, properties, events, processes and relations in every domain of reality