1 / 23

Data integration via XML

Data integration via XML. Ela Hunt John Wilson Vangelis Pafilis Inga Tulloch. http://xtect.cis.strath.ac.uk/. Overview. Four biological scenarios of data integration Data integration - problem definition XTECT indexing approach Literature review Current status and further work.

erik
Download Presentation

Data integration via XML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data integration via XML Ela Hunt John Wilson Vangelis Pafilis Inga Tulloch http://xtect.cis.strath.ac.uk/

  2. Overview • Four biological scenarios of data integration • Data integration - problem definition • XTECT indexing approach • Literature review • Current status and further work Hunt, Wilson, Pafilis and Tulloch, Glasgow

  3. Scenario 1: Cardiovascular Functional Genomics • AIM: discover genes causing hypertension • Rat animal models of hypertension (rat strains which suffer from stroke) • Microarrays are used to compare gene expression in sick and healthy rats, typically 100-400 genes are differentially expressed • microarray results are visualised on maps – and data are interpreted using public web databases (browsing and querying) Hunt, Wilson, Pafilis and Tulloch, Glasgow

  4. SyntenyVista Hunt, Wilson, Pafilis and Tulloch, Glasgow

  5. Scenario 2: Mouse mammary gland development as a model of cancer proliferation • AIM: find genes active in cancer growth • Take mouse samples and apply to a microarray slide • Measure trends in gene expression, identify 400 genes of interest • Use public web databases to interpret information on 400 genes (interpreting 100 genes took 6 months, now the information is out of date) Hunt, Wilson, Pafilis and Tulloch, Glasgow

  6. Scenario 3: Rat model of schizophrenia • AIM: understand which genes are expressed during schizophrenia • Rats have symptoms of schizophrenia after a chemical treatment (2 models are used) • Measure gene expression in two models • Interpret data on 250 genes: find if microarray probes correspond to genes by using BLAST (DNA sequence comparison) and PubMed (bibliographic database) • Gather DNA sequences for real genes from Ensembl (BLAST hits), design probes Hunt, Wilson, Pafilis and Tulloch, Glasgow

  7. Scenario 4:Proteomics • AIM: understand and record protein functions • Case 1: study the proteome of Trypanosoma brucei. For all proteins identified, find information on the web which might shed light on their function • Case 2: interpret data on human proteins differentially expressed in human cells invaded by Toxoplasma gondii. • Compare protein and gene expression • Use SwissProt, PubMed, GeneOntology and any other web resources Hunt, Wilson, Pafilis and Tulloch, Glasgow

  8. Problem definition • Given a large microarray or proteomics experiment (a list of gene names or peptide masses) • Find all known information about those genes or proteins on the web • Make this information accessible Hunt, Wilson, Pafilis and Tulloch, Glasgow

  9. What we expect to achieve Result1: table of integrated information Result2: map of probes and synteny Query: table of names Result3: Clusters based on to the number of relevant query terms found Hunt, Wilson, Pafilis and Tulloch, Glasgow

  10. Use item matching - XML leaves - to start • Match starting from leaves and extend towards the schemas expressed as paths • Use database techniques - indexing • Use data mining techniques – get statistics on data Hunt, Wilson, Pafilis and Tulloch, Glasgow

  11. More detail • Index all paths and leaves in XML trees for a representative set of biological databases • Relational technology • Warehouse • Match leaves (data values) • Find path overlaps => remove redundancies in data Hunt, Wilson, Pafilis and Tulloch, Glasgow

  12. First problem solved:query expansion • 30K human, 30K rat, and 30K mouse genes, some of them have synonyms • Query expansion to include the synonyms • Prototype in Java, 300 ms for synonym lookup • Same idea as in GeneCards which focuses on human data Hunt, Wilson, Pafilis and Tulloch, Glasgow

  13. Second – indexing XML • Medline (40 GB) in XML (bibliographic) • SwissProt + Trembl, 1 GB in XML (proteins) • OMIM and HUGO databases of genes, small (human diseases and human genes) • Affymetrix microarray files for the mouse, small, XML • Ensembl – no XML files, access via MySQL (human, mouse, rat genomes and predicted genes) • Mouse Genome MGD – direct access to Sybase, no XML • Rat database RGD – stores little data! • Gene Ontology – around 1GB in XML Hunt, Wilson, Pafilis and Tulloch, Glasgow

  14. Paths and tags indexed using integer encoding, preserving XML order • Indexing of Medline and OMIM needs to be resolved (text + XML) Hunt, Wilson, Pafilis and Tulloch, Glasgow

  15. How the index will work PubMed Swiss-Prot accession abstract PubMedID GeneName 12345 .. interactions of agene1 with agene2 ... 12345 agene1 Swiss-Prot/PubMedID ~ PubMed/accession Swiss-Prot/GeneName ~ PubMed/abstract Hunt, Wilson, Pafilis and Tulloch, Glasgow

  16. Matching • Db1/path1/socs3 and Db2/path2/socs3 => synonymous paths • Get statistics for full and partial path matches and postulate schema matches • Manually inspect the matched paths, and examine support for each path match • Automate the procedure Hunt, Wilson, Pafilis and Tulloch, Glasgow

  17. Data replicas PubMed Sprot Affy OMIM Hugo Architecture Microarray experiment Proteomics experiment Visualisation INTERACTION List of names Synonym expander XML tree merger PROCESSING LAYER XML tree finder INDEX WAREHOUSE Gene trees XML Mapping generation and lookup

  18. Status • Mirroring external XML data • Query expansion is implemented • Software to XMLise OMIM and some of the MGD • Testing indexing software for loading into Oracle • Designing an algorithm for data mining • Developing ideas on adding sequence comparison and text retrieval, and connecting to visualisation tools (collaboration with e-Science project BRIDGES) Hunt, Wilson, Pafilis and Tulloch, Glasgow

  19. THE VISION To tabular summaries To multiple alignment To sequence

  20. Other work • Schema-based approaches: look at the schemas to find mappings between them • use constraints, tree shape, some data • involve the user/programmer: YATL, Clio, REVERE • Data-based approaches: look at data values in order to find mappings between attributes • ML approaches are inefficient, all-against-all • Problems: • Expensive in terms of labour (programmer or user) • Only very similar schemas can be matched • Not scalable Hunt, Wilson, Pafilis and Tulloch, Glasgow

  21. Recent papers • Kurgan et al., 2002, machine learning for schema matching (2 very similar schemas) • Doan et al., VLDBJ03, machine learning, 2 semi-structured schemas (ontologies), schemas + some data • Chua et al., VLDBJ03, (RDBMS) given entity matches (table names), match attributes (values), based on a variety of statistical tests • Halevy et al, CIDR-2003, user-driven schema matching by example, and mapping by transitivity (no algorithm has been given) Hunt, Wilson, Pafilis and Tulloch, Glasgow

  22. Summary • Aim - to overcome the problems associated with manual or schema-based mapping approaches which are expensive • Scale up, take into account data values • Provide a digest of information for a list of gene/protein names of interest • Using XML and relational indexes Hunt, Wilson, Pafilis and Tulloch, Glasgow

  23. Collaborators at Glasgow Barry Gusterson Andy Jones Torsten Stein Inga Tulloch Catherine Winchester Anna F. Dominiczak Neil Hanlon BRIDGES project (uses DB2) Vangelis Pafilis FUNDING: Carnegie Trust for the Universities of Scotland Medical Research Council (UK) Royal Society Synergy John Wilson

More Related