240 likes | 476 Views
Driving the Terminology Hub. RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C Workshop on RDF Access to Relational Databases 25-26 October, 2007 — Boston, MA, USA. Requirements.
E N D
Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C Workshop on RDF Access to Relational Databases 25-26 October, 2007 — Boston, MA, USA
Requirements • Cross-linking of database information on e.g. genes, proteins, metabolic pathways, compounds, ligands. to the original sources is a key issue. • The productivity for accessing, sharing, searching, navigating, cross-linking and analyzing internal data and external data relevant for the Pharmaceutical industry should be increased 2 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Strategy • In NIBR, we have been developing a semantic integration layer on top of knowledge resources that has been implemented within various services and applications. • It uses • A rich domain-specific terminology (biology, chemistry and medicine) containing 1.6 Mio terms • A Terminology Hub containing 8 GB of referential data (cross-references between data repositories.) • Using that knowledge, the scientist can access all data at hand with just a single mouse-click. 3 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Application Areas for Terminologies • Categorization of documents (via associated taxonomies) • Search for concepts • Semantic expansion of queries using synonyms and related terms • Identification and extraction of relevant concepts (like e.g. targets, genes, diseases, products) from texts • Annotation of textual data with controlled terms as referential anchors • Construction of a semantic layer on top of information sources allowing navigation context-sensitive navigation (Ultralink) 4 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Ultralink makes both of terminologies (entity recognition) and terminology hub (cross referencing) Application Areas for the Terminology Hub • Coherent mapping between Terminologies and Coding Systems (e.g. Uniprot Accession Number for a Protein) • Coherent mapping between internal knowledge repositories(e.g. Biological Assays and Chemical Compounds) • Coherent mapping between external knowledge repositories (e.g. HUGO and OMIM) • Coherent mapping between internal and external knowledge repositories (e.g. Internal Project Code and Product Name) 5 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Activation Ultralink Ultralink Plug-in icon 2 Activation Concept Types Frame UltraLink 6 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
7 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
The Landscape of Knowledge - Rooting the Ultralink in Data Sources/Terminologies • The Ultralink makes use of a broad range of knowledge sources both internal to Novartis and external. The linkage of these terminologies provide the routes along which you can navigate when using the Ultralink. • The linkage between the resources is created automatically via a rule-based mapping procedure and manually by annotation. The latter is extremely important for connecting internal knowledge sources together and to external ones. • The annotations built on the fly by the UltraLink could be stored as RDF annotations associated to a document and be accessed by other computer programs – just in the spirit of the Semantic Web 8 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Concepts and Data Concepts and Terminology The Landscape of Knowledge - Rooting the Ultralink in Data Sources/Terminologies 9 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Underlying terminologies used at NIBR • > 15’000 Companies with > 35’000 terms • > 2’000 Diseases with >19’000 terms • > 150’000 Genes with about 400’000 terms • > 5’000 Modes of Action with > 12’000 terms • > 95’000 Products with > 380’000 terms • > 170’000 Targets with > 250’000 terms • > 310’000 Species with > 435’000 terms • + complete MESH and EMTREE • More than 1’600’000 terms • The terminology consists of terms, and relations between terms (main entry: normalized terms, synonyms, broader terms, narrower terms) 10 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Principles used for the construction of the terminology and organization of terms • In order to create the terminology of reference, terms are extracted from available terminologies (e.g. UniProt, EntrezGene, HGNC, etc.) and the references to the source systems are preserved. • Terms specific to a database are referred aslocal terms. These local terms are stored in a dedicated data structure, the Metastore. Besides the flat set of terms, thesaurus relations such as synonymy, broader term and narrower terms are extracted as well thus allowing to create a thesaurus. • For each entry in the terminology like e.g. for a gene name or for a product, a term is chosen among the list of synonyms and is declared as a “normalized term” • Normalized / global terms, synonyms / local terms as well as broader and narrower terms together with their sources of reference constitute the terminology content behind the UltraLink and are used by the Terminology Hub. 11 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Creating Reference – the Terminology Hub Different knowledge repositories have different ways to encode a concept: • Registry Number • Unique Internal ID • Concept Identifier • Enumerating terms • Just using different terms without any constraints More than 8 GB of cross-referencing information Searching a term T both in source A and B may lead to different results because of different naming/referencing conventions (false negatives in IR) Terminology Hub ensures coherent mapping • Between coding systems • Between different representation levels (e.g. ID vs. Concept) • Between local terms and global terms 12 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Classes of objects covered by the Terminology Hub • Coding systems • A coding system provides a predefined set of (sometimes hierarchical) codes to represent a classification, a nomenclature, a controlled vocabulary, a thesaurus or chemical structures. For example, you can use the MeSH® Tree number C06.405.205.697 to refer to Gastritis in a specific sub-tree of MeSH® • References • Unique and unequivocal identifiers based on a coding system create references in their corresponding data repository. By nature, they are technical artifacts and not part of our scientific natural language (e.g. FTY720), nevertheless most of them deserve to be identified, being used in scientific literature. • Pointers and cross-referencing information • The Metastore contains pointers that allow to cross reference knowledge sources and applications. 13 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Classes of objects covered by the Terminology Hub • Terms • A term is the smallest meaningful linguistic unit on which our domains of discourse (biology, chemistry, medicine) are based. A term is something different than a word because a term can consist of multiple meaningful words such as “chronic obstructive pulmonary disease”. • Concepts • A concept is an abstraction based on properties of individuals that we observe in the world. Individuals that belong to the same concept share a set of common properties. For example, “targets” share the property that they should be druggable. • Data Repositories also named Knowledge Sources • For all kinds of different data, we use the general notion of a data repository. Using the term “data repository” we emphasize the fact that there is a source where some data resides without making any commitments about physical representation (e.g. database or text file) or format of representation (e.g. structured or free text). 14 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Classes of objects covered by the Terminology Hub synonym-ofbroadernarrower Termsspinal cord vascular endothelial growth factor CCR5 Glivec ovarian cancer Novartis Cytomegalovirus ... EncodingIUPAC Structures IDs GIF Symbols Formulas Registry Numbers ... Data RepositoriesInternal Chemistry DB CI sources Literature Patents ... encodes has-type ReferenceCompound nos Project codes Competitor codes PMID 9683255 EntrezGene 450128 CAS 439-14-5 Patent numbers ConceptsSpecies Products Companies Diseases Genes Targets Mammalian Genes ... points--to is-a 15 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Achievements and Improvements • All information about terminologies and cross-references is stored in a relational database (Oracle 10.2.0.2). • The data in the database can be accessed through WebServices allowing user to find normalized terms, pointers for a specific concept-type etc. 16 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Metastore Web ServiceGet all synonyms for a normalized form 17 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
… UltraLink Web ServicesGet all accessible pointer types for a normalized form 18 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Achievements and Improvements • We intend to improve the semantic representation of the data in order to facilitate reuse, interoperability and exchange. • RDF notation and RDF coding standards provide an adequate means for a richer semantic representation. • We use SKOS, DublinCore and other RDF-based coding standards and supplement them with our own RDF vocabulary. 19 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Simple Knowledge Organisation System (example) 20 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Terminology for Diseases (SKOS fragment) 21 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Converting Terminologies to RDF • Clear separation of terminologies from ontologies. We assign a type (rdf:type) to the URI of a term as reference to a concept in an ontology. • Conversion to RDF increased the amount of data rougly by the factor 3. • We obtained more than 5 Mio RDF triplets as a preliminary representation of our terminologies. • We are currently setting up the entire workflow for generation, storing and querying RDF. 22 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Conclusion • The first phase of transforming the terminology to RDF-XML is completed • We are currently developing a model for representing the Terminology Hub in RDF. We expect that an RDF notation of the Terminology Hub will comprise approximately 50 Mio. RDF triples • We intend to test the framework thoroughly (performance, effective semantic gain compared to the current technology) • Closer collaboration with the W3C Healthcare group 23 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007
Acknowledgements Thanks to the ULT team Semantic & Text Analytics Layer Martin Romacker Pierre Parisot Nicolas Grandjean Data Integration & Services Layer Alexander Fromm Laurent Mentek Application Layer Daniel Cronenberger Olivier Kreim Thanks to Manuel Peitsch 24 | Driving the Terminoogy Hub | Therese Vachon | 25.10.2007