230 likes | 409 Views
Biomedical Informatics Group (UPM). WP4: Data Interoperability and Management. David Pérez-Rey – UPM Miguel García-Remesal – UPM. Agenda. SoA Ontology-based Data Integration INFORMA pilot. SoA. Contents Data characteristics and ontologies Integration approaches
E N D
Biomedical Informatics Group(UPM) WP4: Data Interoperability and Management David Pérez-Rey – UPMMiguel García-Remesal – UPM
Agenda • SoA • Ontology-based Data Integration • INFORMA pilot
SoA • Contents • Data characteristics and ontologies • Integration approaches • Ethics & Confidentiality issues • Support for pilots
Ontology-based Data Integration • Schema-level integration • ONTOFUSION update for INFOBIOMED (Web Services and OWL) • Instance-level integration • ONTODATACLEAN for INFOBIOMED • Combination into a single system
Schema-level Integration - ONTOFUSION • Each information source is represented by a “virtual schema” – that is an ontology describing the conceptual structure of the information • “Virtual schemas” are obtained from a mapping process between the physical structure and the ontology: • Top-Down methodology: Domain ontology already created • Bottom-Up methodology: Creating a new domain ontology • Hybrid methodology
USER Unified Virtual Schemas SEARCH Unification Virtual Schemas as Ontologies Mapping Schema-level Integration SNP SNP_Code Id_SNP Physical DBs Local Data SNPs
Instance-level Integration - ONTODATACLEAN • A Ontology as a framework to identify inconsistencies • Terminology • Scale • Format • Patterns • Missing Values • … • Afterwards automatic preprocessing
Instance-level Integration DB1 DB2 Transformation C>T 12 ‘C’->’1’ ‘>’->’’ ‘T’->’2’ 1.0 100 DB1 x 100 Fever High temperature High temperature ->fever Male 1 Male -> 1 … … … 16/11/05 16-11-2005 …
System Architecture Web Services Platform VS Service Web Client Web Server HTTP VS Service User Service Results VS Service Instance Homogenization
Experiments • Testing with Public Databases • Reactome • Gepas – Fibroblast • BioMérieux • Contents of the databases can be downloaded
Reactome • A knowledge base of biological pathways • Terminological inconsistencies (UMLS) and missing values http://www.reactome.org
GEPAS - Fibroblast • The Gene Expression Pattern Analysis Suite • Integrated web-based pipeline for the analysis of gene expression patterns • Scale Transformations http://www.gepas.org
BioMérieux • Biochemical characterization of bacteriological agents • Pattern transformation http://www.biomerieux.com
Data Mining Experiments • Public data sets for data mining • Preprocessing ontology development • Result comparison after preprocessing
Data Mining Experiments • Breast cancer • Clinic data (Ljubljana Oncology Institute, Wisconsin, others) • Gene expression (Duke University and Kent Ridge Biomedical Data Set) • Thyroid – hyper e hypothyroidism • 6 databases from the Garavan Institute (Sydney) • Others
INFORMA Pilot • Document sent to the consortium by INFORMA • Subtopic 1: HIV subtyping and URF repository • Subtopic 2: HIV in vitro drug susceptibility predictor • Subtopic 3: HIV treatment response repository • Subtopic 4: HIV treatment database integration
Hospital 1 Hospital 4 Hospital n Hospital 2 Hospital 3 Hospital 3 BD BD BD Hospital 2 Hospital 4 Centralized Repository BD BD Hospital 1 Hospital n Integration Approaches Centralized vs Distributed
INFORMA Pilot • Objectives • Develop a web-based tool to facilitate export or access of data between a user’s database and the internal database used to implement the functions available on the HIV pol analysis portal • Define a standard (possibly based on ARCA) • Type of data to be handled • HIV pol sequences likely to be recombinants • HIV pol sequences matched with in vitro drug susceptibility • HIV pol sequences coupled with treatment used and follow-up data
INFORMA Pilot • Pilot Challenges • Heterogeneity of data sources • Different schemas, technology… • Heterogeneous data • Need to maintain local autonomy and preferences • Ethical and security issues - Custodix • Privacy, security • Anonymization of sensitive data • Features - Semi-automatically handled: • Heterogeneity conflicts • Semantic conflicts • Descriptive conflicts • Structural conflicts • Implementation status • To be discussed and defined in the Madrid meeting (end of May)
Future actions • Deliverable D25 – “First report on Data Interoperability and Management” – Month 39 • INFORMA Mini-Pilot meeting (End of May in Madrid) – Other partners are welcome • Other collaborations