1 / 17

TcruziKB: Enabling Complex Queries for Genomic Data Exploration

TcruziKB: Enabling Complex Queries for Genomic Data Exploration. Pablo N. Mendes Bobby McKnight Amit P. Sheth Jessica C. Kissinger ICSC - August, 2008. Summary. “An analytical environment for data exploration of interconnected data in genome projects” Ontology-aided data integration

keely
Download Presentation

TcruziKB: Enabling Complex Queries for Genomic Data Exploration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TcruziKB: Enabling Complex Queries for Genomic Data Exploration Pablo N. Mendes Bobby McKnight Amit P. Sheth Jessica C. Kissinger ICSC - August, 2008

  2. Summary “An analytical environment for data exploration of interconnected data in genome projects” • Ontology-aided data integration • Knowledge-driven query formulation • Complex query execution • Multi-perspective visualization

  3. Data Integration • Data Sources: • Genes, Proteins (+Enzymes), Protein Families/Clans • Protein Expression, Life Cycle Stages • Analyses results (OrthoMCL) • Multiple sources cite the same object (e.g. Gene). How to integrate? • Data formats: • OWL-based: ComGO, EnzyO, GO, SO • RDBMS-based GUS Schema • XML-based Interpro • Flat-file-based Pfam and OrthoMCL • Different formats provided by each data source. How to process?

  4. 1. Data Integration • Distribution: • “Warehousing” approach: download and process • Web services approach: ask for data at query time • Heterogeneity: • Off-the-shelf packages such as BioJava help • Most of the work is mapping fields to ontology classes • Identity resolution: • Multiple sources cite the same object (e.g. gene) • Some help from OWL framework, e.g. Inverse Functional Properties (IFPs) • Two instances are the same if their IFP values are the same • More ellaborate analysis with Bioinfo algorithms (e.g.Blast)

  5. 2. Knowledge-driven query formulation Step 1: search for initial term Step 2: choose relationship Step 3: search functions by typing their name Step 4: select function X Step 5: choose new line so:gene so:gene part_of_genome function X ?any_gene ?any_gene so:genome ?genome • Visual query building, in a “semantic trail” style • Inspired in part by Yahoo! Search Assist and Google Suggest • e.g. “Find all genes that are of function X in genome Y1”

  6. Knowledge Driven Query Formulation • User can start a query from any standpoint • support for inverse relationships • GenecodesProtein • Proteincoded byGene • Able to navigate RDF, RDFS and (some) OWL • The “suggest” functions can be different for each language • RDFS (rdfs:domain, rdfs:range) SELECT ?range { ex:prop rdfs:range ?range } • RDF SELECT ?cls { ?a ex:prop ?i . ?i rdf:type ?cls }

  7. Advanced Input Collections: (Amino acid) Sequence specifiedby user:

  8. 3. Query execution • Queries to multiple SPARQL endpoints supported. • SPARQL endpoints can be specified at query time • Supports execution of Web services as part of a query solution

  9. On Demand Service Execution Using Jena Property Functions Lookup “has_function” Yes!Execute “functionpredictor” Inputs: Query context geneK ?function Results found by executed class will be bound back into the query context via the variable ?function SELECT ?function WHERE { :geneK gs:has_function ?function } | (list_member, com.hp.hpl.contains) (has_function, edu.uga.bmb.functionpredictor) ... (has_similarity, org.knoesis.blastinvoker)

  10. 4. Multi-perspective visualization • Multiperspective exploration of results • Graph style brings relationships to first plane • Chart style summarizes at a glance • Tabular explorer with filtering and re-sorting • Do not only show results, but lets user interact with it: • Expand/Contract nodes in graph • Filtering and re-sorting in table

  11. Displaying results

  12. Evaluation • Panel (30 people): • CS students (20), Biology students (10), Bioinfo (5) • Questions: “Which genes have protein expression during a parasite life-cycle stage that is in the human host?” “What are the relationships between gene Tc00.1047053409117.20 and any gene that has protein family PF03645?” • Subjective evaluation (SUS) • Objective evaluation • Number of clicks used to obtain answer • Time spent

  13. Conclusions • Flexibility • Queries from any standpoint • Plug-and-play of datasets • Easy addition of other visualization widgets • Easy addition of other query interfaces • Mashup-ready!!! • Powerful queries • Allows queries for relationships • Ready to allow path queries in near future

  14. Conclusions • Limitations • “Mashups” of large amounts of data at client side are demanding • Simple server-side solution is also provided, but more sophisticated handling is desirable • Computationally intensive services (long execution time) require more sophisticated logging and monitoring

  15. Acknowledgements • People • Grants • American Heart Association award 0330338N to Jessica C. Kissinger • NIH-NHLBI funded "Semantics and Services enabled Problem Solving Environment for T.cruzi" (1R01HL087795-01A1) Sena Arpinar (CSBL/UGA) Mark Heiges, Fernan Aguero, Chih-Horng Kuo and the TcruziDB team Maciej Janik (LSDIS) Matt Eavenson(LSDIS)

  16. Thank you! Questions? Pablo Mendes (mendes2@wright.edu) Bobby McKnight(mcknight@cs.uga.edu) Amit Sheth (amit.sheth@wright.edu) Jessica Kissinger(jkissing@uga.edu)

More Related