530 likes | 699 Views
Data Acquisition from Semantically Heterogeneous Biomedical Data Sources on the Internet. Dr Sharifullah Khan School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST) H-12 Islamabad, Pakistan sharifullah.khan@seecs.edu.pk.
E N D
Data Acquisition from Semantically Heterogeneous Biomedical Data Sources on the Internet Dr Sharifullah Khan School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST) H-12 Islamabad, Pakistan sharifullah.khan@seecs.edu.pk
Presentation Outline • Introduction • Integration issues • Integration Approaches • Domain Ontology • Conclusion
PublicBiomedical Sources • Technological developments made scientists capable to produce data expeditiously. • Sources available online in a large number publicly • For Example: 3
GenBank http://www.ncbi.nih.gov/Genbank/index.html
Swiss-Prot http://www.expasy.org/sprot/
MedLinePlus http://medlineplus.gov/
OMIM http://www.nslij-genetics.org/search_omim.html
DDBJ http://www.ddbj.nig.ac.jp/
EMBL-EBI http://www.ebi.ac.uk/embl/
Gallus gallus http://www.agbase.msstate.edu/
Rat Genome Database http://rgd.mcw.edu/
GeneDB http://old.genedb.org/genedb/pombe/
WormBase http://www.wormbase.org/
Mouse Genome Informatics http://www.geneontology.org/GO.refgenome.shtml#refsppdb
Why to Integrate • Sources are intermediaries between experimental observations and additional synthesis. • Integrating sources for processing queries to extract new knowledge, e.g.: • What are the functions of genes? • What variations do exist in the genome? • How variations give rise to illnesses? • What drugs are discovered?
Biomedical Queries • What genes cause disease: ‘Achondroplasia’? • What mutation have been found in the genes cause the disease: ‘achondroplasia’? • What is the 3D structure of all alcohol dehydro-genases that belong to the enzyme family EC:1.1.1.1 and is located within the human chromosome region 4q21-4q23 ?
A Neuroscientist’s Information Integration Problem ? Data Integration protein localization (NCMIR) sequence info (CaPROT) neurotransmission (SENSELAB) What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multi-Worlds” Mediation morphometry (SYNAPSE)
Main Barriers • Large & dynamic data Volume. • Diverse Source Foci. • Different querying capabilities. • Wide Variety of Data Representation.
Representational Heterogeneity • Structural Differences • Synonymy • Semantic Differences • Polysemy • Content Differences • Data Complexity
Structural Differences • Structured Databases: RDBMS, ODBMS. • Semi-structured Databases: XML. • Un-structured Databases: flat files, Text. • Tools: BLAST, Entrez, PubMed. • Interoperability is a challenge. • Various querying capabilities.
Synonymy • Distinct lexical terms denoting the same semantic objects. • Doctor versus Physician. • MRN versus Patient_ID.
Semantic Differences • The same identifier but data values have different meaning. • Blood-Culture-Growth: • No, moderate, and significant. • 0, 1+, 2+, 3+, and 4+. • List of synonyms: • GDB: Unigene Number. • GO: International Protein Index (IPI). • Swiss-Prot: Enzyme Commission Number.
Polysemy • Data values with multiple meaning. • Examples: • FBN1 – Human fibrillin 1 gene. • Fbn1 – Mouse fibrillin 1 gene. • MFS1 – Human fibrillin 1 gene. • MFS1 – Marfan Syndrome.
Content Difference • Data inconsistency. • Implicit Values: • Currency – Euro, dollar and Pound. • Derivable Values: • Date-of-birth and age. • Missing values.
Data Complexity • Gene names, Gene product names and Gene functions names. • The number of synonyms. • The non-intuitive nature of synonyms. • Name: dystrophin (muscular dystrophy, Duchenne and Becker type). • Abbrivaition: dystrophin. • Gene Product name in: • Rate: Dystrophin. • Fruit fly: dystrophin.
Integration Approaches • Information Linkage Approach. • Data Warehousing Approach. • Mediation Approach.
Information Linkage • Data and records are statically linked in sources. • Links are maintained in a comprehensive index. • Query is navigated through sources via existing links.
Existing Systems • SRS – [Etzold & Argos, 1993] • Entrez – [www.ncbi.nlm.nih.gov/Entrez] • LinkDB – [Fujibuchi, et.al., 1997] • GeneCards – [Rebhan, et.al., 1997]
Data Warehousing • Sources are duplicated on a local server. • A uniform interface is built. • Queries are issued against the server.
Existing Systems • Genomic Unified Schema (GUS) System – [Davidson, et.al., 2001] • Atlas – [Shah, et.al., 2005]
Mediator Wrapper Wrapper Wrapper Data Source Data Source Data Source Mediation Mediator-wrapper Architecture 6
Mediation • Global schema for the data access. • No data duplication. • Query translation instead of data. • Mediator-wrappers architecture.
Existing Systems • TAMBIS – [Goble, et.al., 2001] • SIMS – [Arens, et.al, 2003] • DiscoveryLink – [Haas, et.al., 2001] • OPM – [Chen, et.al., 1995] • BACIIS – [Ben-Miled, et.al.,2004]
Notable Issues • Existing systems are limited in: • Scalability. • Sources transparency. • Hiding querying complexities.
Ontology • Conceptual framework for a structured representation of meaning, through a common vocabulary. • A system that describes concepts and their relationships. • More than a simple list of vocabulary.
Fruit Database Goods relation
ISA ISAspecific HasColor Object Phys. Object Abst. Object Color Fruit Red Apple Green Yellow Kiwi Red Apple Yellow Apple Fruit Ontology
Example Queries • Q1: Find the quantity of Kiwi Fruit. • (Kiwi which <isa (Fruit)>). • Q2: Find the quantity of Red apples. • (Red-apple) or • (apple which <hascolor (red)>)
Proposed Architecture Domain Ontology Graphical Query Generator Users Query Reformulator Result Merging Source Models Wrapper Wrapper Wrapper Wrapper Data Sources Email Textual Database XML
Objectives • Provide source transparency. • Preserve local autonomy. • Make integration scalable. • Hide the query processing Complexities.
LessonLearned • Exploring and establishing research area. • Not truly fullfil current challenges. • Organization specific Systems. • Web-based integration is younger. • Extension either as a whole or in parts.
References • Biomedical Integration - Survey • Thomes et.al. Integration of Biological Sources: Current Systems and Challenges Ahead, SIGMOD Record, Sept, 2004. • Köhler. Integration of Life Science Databases, Drug Discovery Today: BIOSILICO, 2(2), 2004. • Sujansky. Heterogeneous Database Integration in Biomedicine, J. Biomedical Informatics, 34, 2001. • Domain Ontology - General • Perez-Rey. Biomedical Ontologies in Post-Genomic Information Systems, BIBE – 2004, Taiwan. • Gardner. Ontologies and Semantic Data Integration, Drug Discovery Today: BIOSILICO, 10(14), 2004.