330 likes | 445 Views
Biomedical Informatics Group (UPM). WP4: Data Interoperability and Management. Presentation: UPM. List of Contents. Deliverable D25: “First report on Data Interoperability and Management” Recent results Plans for future until the end of the NoE. Contents of Deliverable 25. Introduction
E N D
Biomedical Informatics Group(UPM) WP4: Data Interoperability and Management Presentation: UPM
List of Contents • Deliverable D25: “First report on Data Interoperability and Management” • Recent results • Plans for future until the end of the NoE
Contents of Deliverable 25 • Introduction • Integrating Approaches • OntoFusion Plus • OntoDataClean • VIH databases integration (Informa) • PML • Ethics and Confidentiality Issues • References
Original OntoFusion • Ontology-based Data Integration system • Schema-level integration • Each information source is associated to a “Virtual Schema” (Ontology which represents its conceptual structure) • “Virtual Schemas” are obtained after a mapping process between physical structures and a domain ontology • An automatic unification process allows unifying several “Virtual Schemas” • Ontologies are represented in DAML+OIL
OntoFusion Plus • New version developed for INFOBIOMED • Agent-based architecture has been migrated to a Web Service oriented architecture • The knowledge representation language has been updated to OWL (Web Ontology Language)
Search Application Server Registry Registry Web Client Search JADE Agent Platform Request Registry Registry Search BDV BDV JDBC JDBC Results BD BD Search Original OntoFusion approach Agent-based architecture
OntoDataClean • Instance level integration • Support to KDD processes, focused on automatic preprocessing of data, previous to data mining algorithms • Use of an ontology to eliminate or solve data inconsistencies • Terminology • Scale • Range • Format • Missing values • …
Preprocessing Transformation Data Mining Interpretation Knowledge Patterns Selection Transformed data Preprocessed data KDD y Ontologías Biomedical ontologies Methodological ontologies Data Warehouse Target data
OntoDataClean OntoFusion & OntoDataClean Web Services Platform VS Service Web Client Web Server HTTP VS Service User Service Results VS Service
OntoDataClean Order Source DB Fields Data Source Cleaning Model URL Pattern Missing Values Duplicate Format Scale Rule Synonym Data Type Expression String Synonym DB Values Regular Expression Replacement URL Name Condition Detection Transformation Preferred Name Average Column Replacement Condition Representative Values Missing Value Ranges Most Frequently Replacement Representative Values Condition Value Ranges Row Removal String Replacement Fig. 2. OntoDataClean Preprocessing Ontology
OntoDataClean • Experiments carried out with three different public databases, selected because their contents can be downloaded to a local machine: • Reactome • Gepas – Fibroblast • BioMérieux
OntoDataClean Experiments (I) • BIOMERIEUX: Biochemical characterization of bacteriological agents • Pattern transformation http://www.biomerieux.com
Preprocessing ontology for BioMérieux (biochemical profiles) experiments, implemented using Protégé
An example of BioMérieux data transformation using OntoDataClean • Id – Test identifier • Results – Biochemical profiles using binary codification • Id’ – Test identifier • Results’ – Biochemical profiles using decimal codification
OntoDataClean Experiments (II) • REACTOME: A knowledge base of biological pathways • Terminological inconsistencies (resolved using the UMLS) and missing values, complex pattern modifications on string data concerning urls, erasure of duplicate values, synonym substitutions and missing values transformations http://www.reactome.org
OntoDataClean Experiments (III) • The Gene Expression Pattern Analysis Suite • Integrated web-based pipeline for the analysis of gene expression patterns • Scale Transformations http://www.gepas.org
OntoDataClean Paper accepted and presented on the VII International Symposium on Biological and Medical Data Analysis (ISBMDA 2006) in Thessaloniki (Greece) on December 7th-8th, 2006
1 paper presented • A plenary session dedicated to INFOBIOMED, with emphasis on technological tools (WP4 and WP5)
Just Published: Alonso-Calvo R, Maojo V, Billhardt H, Martin-Sanchez F, Garcia-Remesal M, Perez-Rey D. An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform. 2007 Feb;40(1):17-29.
Current efforts at UPM • To automate the mapping process • To integrate OntoFusion Plus and OntoDataClean • A new interface for OntoFusion. 2 approaches: • Search based on free text • To integrate the Mobility Brokerage Service with OntoFusion • Additional work, based on Grid Services, carried out for the ACGT project (Grid services orchestration, implementation of a new Cancer ontology, ontology-based information retrieval, a semantic mediator to design workflows using different bioinformatics tools)
Current Pending VIH distributed databases • Proposal for a demo oriented to the integration of viral genomics with clinical data in VIH infections • Ontology-based integration of databases
VIH proposal • Use OntoFusion to integrate the databases • Trying to establish mappings among databases involved, using ARCA database as reference • Pending feedback about: • Final structures to be integrated if changes are made according to suggestions proposed • Type of access to databases • Kind of results expected from the integrated databases
June. 26th Meeting in Madrid 01/10/2006 – 01/11/2006 Discussion 07/07/2006 – 01/10/2006 Analysis of DBs 01/11/2006 – 01/03/2007 01/03/2007 – 30/06/2007 Original Planning for VIH Mini-Pilot Implementation Testing
PML (Anthony Brookes) • Polymorphism Markup Language • Definition of a data model for phenotype data, based on ‘Entity-Attribute-Value’ triplet concept
PML (Anthony Brookes) • Work on other two database projects: • GenoScore. A self-contained database application for storing genotype data and clinical material details as pertinent to the activities of typical laboratories involved in genetic association studies. • Human Genome Variation Database – Genotype-to-Phenotype (HGVbase-G2P). A public ‘central database’ for genetic association data (summary level information) generated by the community.
Ethics and Confidentiality Issues (Custodix) • Overview and Analysis of Existing Techniques • Encrypted Storage on Untrusted Servers • Related research topics • Commercial encrypted storage solutions • Privacy enhanced searching • The Privacy Enhanced Storage (PES) Framework • Scope • PES design considerations • PES framework • Implementation • Known limitations and future work
End February - 1st week March Today M39 D25 Draft for Internal Review 24/01/07 1st Schema Draft sent Deadline for D25 Planning for Deliverable 25