1 / 53

Data Acquisition from Semantically Heterogeneous Biomedical Data Sources on the Internet

Data Acquisition from Semantically Heterogeneous Biomedical Data Sources on the Internet. Dr Sharifullah Khan School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST) H-12 Islamabad, Pakistan sharifullah.khan@seecs.edu.pk.

india
Download Presentation

Data Acquisition from Semantically Heterogeneous Biomedical Data Sources on the Internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Acquisition from Semantically Heterogeneous Biomedical Data Sources on the Internet Dr Sharifullah Khan School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST) H-12 Islamabad, Pakistan sharifullah.khan@seecs.edu.pk

  2. Presentation Outline • Introduction • Integration issues • Integration Approaches • Domain Ontology • Conclusion

  3. PublicBiomedical Sources • Technological developments made scientists capable to produce data expeditiously. • Sources available online in a large number publicly • For Example: 3

  4. GenBank http://www.ncbi.nih.gov/Genbank/index.html

  5. Swiss-Prot http://www.expasy.org/sprot/

  6. MedLinePlus http://medlineplus.gov/

  7. OMIM http://www.nslij-genetics.org/search_omim.html

  8. DDBJ http://www.ddbj.nig.ac.jp/

  9. EMBL-EBI http://www.ebi.ac.uk/embl/

  10. Gallus gallus http://www.agbase.msstate.edu/

  11. Rat Genome Database http://rgd.mcw.edu/

  12. GeneDB http://old.genedb.org/genedb/pombe/

  13. WormBase http://www.wormbase.org/

  14. Mouse Genome Informatics http://www.geneontology.org/GO.refgenome.shtml#refsppdb

  15. Why to Integrate • Sources are intermediaries between experimental observations and additional synthesis. • Integrating sources for processing queries to extract new knowledge, e.g.: • What are the functions of genes? • What variations do exist in the genome? • How variations give rise to illnesses? • What drugs are discovered?

  16. Biomedical Queries • What genes cause disease: ‘Achondroplasia’? • What mutation have been found in the genes cause the disease: ‘achondroplasia’? • What is the 3D structure of all alcohol dehydro-genases that belong to the enzyme family EC:1.1.1.1 and is located within the human chromosome region 4q21-4q23 ?

  17. A Neuroscientist’s Information Integration Problem ? Data Integration protein localization (NCMIR) sequence info (CaPROT) neurotransmission (SENSELAB) What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multi-Worlds” Mediation morphometry (SYNAPSE)

  18. Data Integration

  19. Main Barriers • Large & dynamic data Volume. • Diverse Source Foci. • Different querying capabilities. • Wide Variety of Data Representation.

  20. Representational Heterogeneity • Structural Differences • Synonymy • Semantic Differences • Polysemy • Content Differences • Data Complexity

  21. Structural Differences • Structured Databases: RDBMS, ODBMS. • Semi-structured Databases: XML. • Un-structured Databases: flat files, Text. • Tools: BLAST, Entrez, PubMed. • Interoperability is a challenge. • Various querying capabilities.

  22. Synonymy • Distinct lexical terms denoting the same semantic objects. • Doctor versus Physician. • MRN versus Patient_ID.

  23. Semantic Differences • The same identifier but data values have different meaning. • Blood-Culture-Growth: • No, moderate, and significant. • 0, 1+, 2+, 3+, and 4+. • List of synonyms: • GDB: Unigene Number. • GO: International Protein Index (IPI). • Swiss-Prot: Enzyme Commission Number.

  24. Polysemy • Data values with multiple meaning. • Examples: • FBN1 – Human fibrillin 1 gene. • Fbn1 – Mouse fibrillin 1 gene. • MFS1 – Human fibrillin 1 gene. • MFS1 – Marfan Syndrome.

  25. Content Difference • Data inconsistency. • Implicit Values: • Currency – Euro, dollar and Pound. • Derivable Values: • Date-of-birth and age. • Missing values.

  26. Data Complexity • Gene names, Gene product names and Gene functions names. • The number of synonyms. • The non-intuitive nature of synonyms. • Name: dystrophin (muscular dystrophy, Duchenne and Becker type). • Abbrivaition: dystrophin. • Gene Product name in: • Rate: Dystrophin. • Fruit fly: dystrophin.

  27. Integration Approaches • Information Linkage Approach. • Data Warehousing Approach. • Mediation Approach.

  28. Information Linkage

  29. Information Linkage • Data and records are statically linked in sources. • Links are maintained in a comprehensive index. • Query is navigated through sources via existing links.

  30. Existing Systems • SRS – [Etzold & Argos, 1993] • Entrez – [www.ncbi.nlm.nih.gov/Entrez] • LinkDB – [Fujibuchi, et.al., 1997] • GeneCards – [Rebhan, et.al., 1997]

  31. Data Warehousing

  32. Data Warehousing • Sources are duplicated on a local server. • A uniform interface is built. • Queries are issued against the server.

  33. Existing Systems • Genomic Unified Schema (GUS) System – [Davidson, et.al., 2001] • Atlas – [Shah, et.al., 2005]

  34. Mediator Wrapper Wrapper Wrapper Data Source Data Source Data Source Mediation Mediator-wrapper Architecture 6

  35. Mediation • Global schema for the data access. • No data duplication. • Query translation instead of data. • Mediator-wrappers architecture.

  36. Existing Systems • TAMBIS – [Goble, et.al., 2001] • SIMS – [Arens, et.al, 2003] • DiscoveryLink – [Haas, et.al., 2001] • OPM – [Chen, et.al., 1995] • BACIIS – [Ben-Miled, et.al.,2004]

  37. Notable Issues • Existing systems are limited in: • Scalability. • Sources transparency. • Hiding querying complexities.

  38. Ontology • Conceptual framework for a structured representation of meaning, through a common vocabulary. • A system that describes concepts and their relationships. • More than a simple list of vocabulary.

  39. Fruit Database Goods relation

  40. ISA ISAspecific HasColor Object Phys. Object Abst. Object Color Fruit Red Apple Green Yellow Kiwi Red Apple Yellow Apple Fruit Ontology

  41. Example Queries • Q1: Find the quantity of Kiwi Fruit. • (Kiwi which <isa (Fruit)>). • Q2: Find the quantity of Red apples. • (Red-apple) or • (apple which <hascolor (red)>)

  42. http://www.geneontology.org/

  43. http://www.nlm.nih.gov/research/umls/

  44. Back

  45. Proposed Architecture Domain Ontology Graphical Query Generator Users Query Reformulator Result Merging Source Models Wrapper Wrapper Wrapper Wrapper Data Sources Email Textual Database XML

  46. Objectives • Provide source transparency. • Preserve local autonomy. • Make integration scalable. • Hide the query processing Complexities.

  47. LessonLearned • Exploring and establishing research area. • Not truly fullfil current challenges. • Organization specific Systems. • Web-based integration is younger. • Extension either as a whole or in parts.

  48. References • Biomedical Integration - Survey • Thomes et.al. Integration of Biological Sources: Current Systems and Challenges Ahead, SIGMOD Record, Sept, 2004. • Köhler. Integration of Life Science Databases, Drug Discovery Today: BIOSILICO, 2(2), 2004. • Sujansky. Heterogeneous Database Integration in Biomedicine, J. Biomedical Informatics, 34, 2001. • Domain Ontology - General • Perez-Rey. Biomedical Ontologies in Post-Genomic Information Systems, BIBE – 2004, Taiwan. • Gardner. Ontologies and Semantic Data Integration, Drug Discovery Today: BIOSILICO, 10(14), 2004.

More Related