TcruziKB: Enabling Complex Queries for Genomic Data Exploration

TcruziKB: Enabling Complex Queries for Genomic Data Exploration Pablo N. Mendes Bobby McKnight Amit P. Sheth Jessica C. Kissinger ICSC - August, 2008

Summary “An analytical environment for data exploration of interconnected data in genome projects” • Ontology-aided data integration • Knowledge-driven query formulation • Complex query execution • Multi-perspective visualization

Data Integration • Data Sources: • Genes, Proteins (+Enzymes), Protein Families/Clans • Protein Expression, Life Cycle Stages • Analyses results (OrthoMCL) • Multiple sources cite the same object (e.g. Gene). How to integrate? • Data formats: • OWL-based: ComGO, EnzyO, GO, SO • RDBMS-based GUS Schema • XML-based Interpro • Flat-file-based Pfam and OrthoMCL • Different formats provided by each data source. How to process?

1. Data Integration • Distribution: • “Warehousing” approach: download and process • Web services approach: ask for data at query time • Heterogeneity: • Off-the-shelf packages such as BioJava help • Most of the work is mapping fields to ontology classes • Identity resolution: • Multiple sources cite the same object (e.g. gene) • Some help from OWL framework, e.g. Inverse Functional Properties (IFPs) • Two instances are the same if their IFP values are the same • More ellaborate analysis with Bioinfo algorithms (e.g.Blast)

2. Knowledge-driven query formulation Step 1: search for initial term Step 2: choose relationship Step 3: search functions by typing their name Step 4: select function X Step 5: choose new line so:gene so:gene part_of_genome function X ?any_gene ?any_gene so:genome ?genome • Visual query building, in a “semantic trail” style • Inspired in part by Yahoo! Search Assist and Google Suggest • e.g. “Find all genes that are of function X in genome Y1”

Knowledge Driven Query Formulation • User can start a query from any standpoint • support for inverse relationships • GenecodesProtein • Proteincoded byGene • Able to navigate RDF, RDFS and (some) OWL • The “suggest” functions can be different for each language • RDFS (rdfs:domain, rdfs:range) SELECT ?range { ex:prop rdfs:range ?range } • RDF SELECT ?cls { ?a ex:prop ?i . ?i rdf:type ?cls }

Advanced Input Collections: (Amino acid) Sequence specifiedby user:

3. Query execution • Queries to multiple SPARQL endpoints supported. • SPARQL endpoints can be specified at query time • Supports execution of Web services as part of a query solution

On Demand Service Execution Using Jena Property Functions Lookup “has_function” Yes!Execute “functionpredictor” Inputs: Query context geneK ?function Results found by executed class will be bound back into the query context via the variable ?function SELECT ?function WHERE { :geneK gs:has_function ?function } | (list_member, com.hp.hpl.contains) (has_function, edu.uga.bmb.functionpredictor) ... (has_similarity, org.knoesis.blastinvoker)

4. Multi-perspective visualization • Multiperspective exploration of results • Graph style brings relationships to first plane • Chart style summarizes at a glance • Tabular explorer with filtering and re-sorting • Do not only show results, but lets user interact with it: • Expand/Contract nodes in graph • Filtering and re-sorting in table

Displaying results

Evaluation • Panel (30 people): • CS students (20), Biology students (10), Bioinfo (5) • Questions: “Which genes have protein expression during a parasite life-cycle stage that is in the human host?” “What are the relationships between gene Tc00.1047053409117.20 and any gene that has protein family PF03645?” • Subjective evaluation (SUS) • Objective evaluation • Number of clicks used to obtain answer • Time spent

Conclusions • Flexibility • Queries from any standpoint • Plug-and-play of datasets • Easy addition of other visualization widgets • Easy addition of other query interfaces • Mashup-ready!!! • Powerful queries • Allows queries for relationships • Ready to allow path queries in near future

Conclusions • Limitations • “Mashups” of large amounts of data at client side are demanding • Simple server-side solution is also provided, but more sophisticated handling is desirable • Computationally intensive services (long execution time) require more sophisticated logging and monitoring

Acknowledgements • People • Grants • American Heart Association award 0330338N to Jessica C. Kissinger • NIH-NHLBI funded "Semantics and Services enabled Problem Solving Environment for T.cruzi" (1R01HL087795-01A1) Sena Arpinar (CSBL/UGA) Mark Heiges, Fernan Aguero, Chih-Horng Kuo and the TcruziDB team Maciej Janik (LSDIS) Matt Eavenson(LSDIS)

Thank you! Questions? Pablo Mendes (mendes2@wright.edu) Bobby McKnight(mcknight@cs.uga.edu) Amit Sheth (amit.sheth@wright.edu) Jessica Kissinger(jkissing@uga.edu)

TcruziKB: Enabling Complex Queries for Genomic Data Exploration

TcruziKB: Enabling Complex Queries for Genomic Data Exploration

Presentation Transcript

Data Management: Queries

Entangled Queries :Enabling Declarative Data Driven Coordination

Lesson 33: Creating Complex Queries

Technology Enabling the Exploration of Mars

Creating Complex Queries with Nested queries

Visualization of genomic data

Visualization of genomic data

Data Structures for Orthogonal Range Queries

Directed Exploration of Complex Systems

Enabling complex queries to drug information sources through functional composition

MIS710 Module 2a Complex SQL Queries

Complex Data

Regression, correlation and liquid association in complex genomic data analysis

Enabling Data Infrastructure for Utility Sustainability

Custom Data Queries

Queries for data modification: Action queries

Data Queries

Creating complex queries using nesting

Data Management: Queries

More SQL: Complex Queries,

SQL Training Complex Queries

ENABLING DATA REVOLUTION