170 likes | 286 Views
TcruziKB: Enabling Complex Queries for Genomic Data Exploration. Pablo N. Mendes Bobby McKnight Amit P. Sheth Jessica C. Kissinger ICSC - August, 2008. Summary. “An analytical environment for data exploration of interconnected data in genome projects” Ontology-aided data integration
E N D
TcruziKB: Enabling Complex Queries for Genomic Data Exploration Pablo N. Mendes Bobby McKnight Amit P. Sheth Jessica C. Kissinger ICSC - August, 2008
Summary “An analytical environment for data exploration of interconnected data in genome projects” • Ontology-aided data integration • Knowledge-driven query formulation • Complex query execution • Multi-perspective visualization
Data Integration • Data Sources: • Genes, Proteins (+Enzymes), Protein Families/Clans • Protein Expression, Life Cycle Stages • Analyses results (OrthoMCL) • Multiple sources cite the same object (e.g. Gene). How to integrate? • Data formats: • OWL-based: ComGO, EnzyO, GO, SO • RDBMS-based GUS Schema • XML-based Interpro • Flat-file-based Pfam and OrthoMCL • Different formats provided by each data source. How to process?
1. Data Integration • Distribution: • “Warehousing” approach: download and process • Web services approach: ask for data at query time • Heterogeneity: • Off-the-shelf packages such as BioJava help • Most of the work is mapping fields to ontology classes • Identity resolution: • Multiple sources cite the same object (e.g. gene) • Some help from OWL framework, e.g. Inverse Functional Properties (IFPs) • Two instances are the same if their IFP values are the same • More ellaborate analysis with Bioinfo algorithms (e.g.Blast)
2. Knowledge-driven query formulation Step 1: search for initial term Step 2: choose relationship Step 3: search functions by typing their name Step 4: select function X Step 5: choose new line so:gene so:gene part_of_genome function X ?any_gene ?any_gene so:genome ?genome • Visual query building, in a “semantic trail” style • Inspired in part by Yahoo! Search Assist and Google Suggest • e.g. “Find all genes that are of function X in genome Y1”
Knowledge Driven Query Formulation • User can start a query from any standpoint • support for inverse relationships • GenecodesProtein • Proteincoded byGene • Able to navigate RDF, RDFS and (some) OWL • The “suggest” functions can be different for each language • RDFS (rdfs:domain, rdfs:range) SELECT ?range { ex:prop rdfs:range ?range } • RDF SELECT ?cls { ?a ex:prop ?i . ?i rdf:type ?cls }
Advanced Input Collections: (Amino acid) Sequence specifiedby user:
3. Query execution • Queries to multiple SPARQL endpoints supported. • SPARQL endpoints can be specified at query time • Supports execution of Web services as part of a query solution
On Demand Service Execution Using Jena Property Functions Lookup “has_function” Yes!Execute “functionpredictor” Inputs: Query context geneK ?function Results found by executed class will be bound back into the query context via the variable ?function SELECT ?function WHERE { :geneK gs:has_function ?function } | (list_member, com.hp.hpl.contains) (has_function, edu.uga.bmb.functionpredictor) ... (has_similarity, org.knoesis.blastinvoker)
4. Multi-perspective visualization • Multiperspective exploration of results • Graph style brings relationships to first plane • Chart style summarizes at a glance • Tabular explorer with filtering and re-sorting • Do not only show results, but lets user interact with it: • Expand/Contract nodes in graph • Filtering and re-sorting in table
Evaluation • Panel (30 people): • CS students (20), Biology students (10), Bioinfo (5) • Questions: “Which genes have protein expression during a parasite life-cycle stage that is in the human host?” “What are the relationships between gene Tc00.1047053409117.20 and any gene that has protein family PF03645?” • Subjective evaluation (SUS) • Objective evaluation • Number of clicks used to obtain answer • Time spent
Conclusions • Flexibility • Queries from any standpoint • Plug-and-play of datasets • Easy addition of other visualization widgets • Easy addition of other query interfaces • Mashup-ready!!! • Powerful queries • Allows queries for relationships • Ready to allow path queries in near future
Conclusions • Limitations • “Mashups” of large amounts of data at client side are demanding • Simple server-side solution is also provided, but more sophisticated handling is desirable • Computationally intensive services (long execution time) require more sophisticated logging and monitoring
Acknowledgements • People • Grants • American Heart Association award 0330338N to Jessica C. Kissinger • NIH-NHLBI funded "Semantics and Services enabled Problem Solving Environment for T.cruzi" (1R01HL087795-01A1) Sena Arpinar (CSBL/UGA) Mark Heiges, Fernan Aguero, Chih-Horng Kuo and the TcruziDB team Maciej Janik (LSDIS) Matt Eavenson(LSDIS)
Thank you! Questions? Pablo Mendes (mendes2@wright.edu) Bobby McKnight(mcknight@cs.uga.edu) Amit Sheth (amit.sheth@wright.edu) Jessica Kissinger(jkissing@uga.edu)