1 / 32

Dialogue DataGrid Motivating Applications

Dialogue DataGrid Motivating Applications. Joel Saltz, MD, PhD Chair Biomedical Informatics College of Medicine The Ohio State University. Dialogue DataGrid. Relational databases, files, XML databases, object stores Strongly typed Multi-tiered metadata management system

len-vincent
Download Presentation

Dialogue DataGrid Motivating Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dialogue DataGridMotivating Applications Joel Saltz, MD, PhD Chair Biomedical Informatics College of Medicine The Ohio State University

  2. Dialogue DataGrid • Relational databases, files, XML databases, object stores • Strongly typed • Multi-tiered metadata management system • Incorporates elements from OGSA-DAI, Mobius, caGrid, STORM, DataCutter, GT4 … • Scales to very large data, high end platforms

  3. Requirements • Support or interoperate with caGrid, eScience infrastructure • Interoperate with or replace SRB • Well defined relationship to Globus Alliance • Services to support high end large scale data applications • Design should include semantic metadata management • Well thought out relationship to commercial products (e.g. Information Integrator, Oracle)

  4. Information heterogeneity, data coordination and data size Synthesize information from many high throughput information sources Sources can include multiple types of high throughput molecular data and multiple imaging modalities. Coordinated efforts at multiple sites Detailed understanding of biomedical phenomena will increasingly involve the need to analyze very large high resolution spatio-temporal datasets Motivating Application Class I: Phenotype characterization

  5. Structural Complexity

  6. What are the mechanisms of fetal death in mutant mice? What structural changes occur in the placenta? How different are the structural changes between the wild and mutant types? … Rb+ Rb- Example Questions (Phenotypes associated with Rb knockouts)

  7. Dataset Size: Systems Biology Future big science animal experiments on cancer, heart disease, pathogen host response Basic small mouse is 3 cm3 1 μ resolution – very roughly 1013 bytes/mouse Molecular data (spatial location) multiply by 102 Vary genetic composition, environmental manipulation, systematic mechanisms for varying genetic expression; multiply by 103 Total: 1018 bytes per big science animal experiment

  8. Now: Virtual Slides(roughly 25TB/cm2 tissue)

  9. Compare phenotypes of normal vs Rb deficient mice Alignment Slides/Slices Placenta Visualization Segmentation

  10. Computational Phenotyping Challenges • Very large datasets • Automated image analysis • Three dimensional reconstruction • Motion • Integration of multiple data sources • Data indexing and retrieval

  11. Large Scale Data Middleware Requirements • Spatio-temporal datasets • Very large datasets • Tens of gigabytes to 100+ TB data • Lots of datasets • Up to thousands of runs for a study are possible • Data can be stored in distributed collection of files • Distributed datasets • Data may be captured at multiple locations by multiple groups • Simulations are carried out at multiple sites • Common operations: subsetting, filtering, interpolations, projections, comparisons, frequency counts

  12. Very Large Dataset Hardware is Proliferating Our Example: Ohio Supercomputing Center Mass Storage Testbed • 50 TB of performance storage • home directories, project storage space, and long-term frequently accessed files. • 420 TB of performance/capacity storage • Active Disk Cache - compute jobs that require directly connected storage • parallel file systems, and scratch space. • Large temporary holding area • 128 TB tape library • Backups and long-term "offline" storage

  13. STORM Services • Query • Meta-data • Indexing • Data Source • Filtering • Partition Generation • Data Mover

  14. STORM Results Seismic Datasets 10-25GB per file. About 30-35TB of Data.

  15. Motivating Application II: caBIG In vivo Imaging Workspace Testbed • Study the effects of image acquisition and reconstruction parameters (i.e. slice thickness, reconstruction kernel and dose) on CAD and on human ROC. • use multiple datasets and several CAD algorithms to investigate the relationship between radiation dose and nodule detection ROC. • Cooperative Group Imaging Study Support • Children’s Oncology Group: quantify whether perfusion study results add any additional predictive value to the current battery of histopathological and molecular studies • CALGB: Grid based analysis of PET/CT data to support phase I, II studies • NTROI: Grid based OMIRAD -- registration, fusion and analysis of MR and Diffusive Optical Tomography (DOT).

  16. CAD Testbed Project RSNA 2005 (joint with Eliot Siegel et al at Univ. Maryland) • Expose algorithms and data management as Grid Services • Remote execution of multiple CAD algorithms using multiple image databases • CAD algorithm validation with larger set of images • Better research support — recruit from multiple institutions, demographic relationships, outcome management etc. • Remote CAD execution - reduced data transfer & avoid need to transmit PHI • CAD compute farms that reduce the turnover time • Scalable and open source — caBIG standards

  17. Architecture

  18. 5 Participating Data Services 3x Chicago 1x Columbus 1x Los Angeles Image Data Service • Expose data in DICOM PACS with grid service wrappers • An open source DICOM server — Pixelmed • XML based data transfer

  19. CAD Algorithm Service • Grid services for algorithm invocation and image retrieval service • caGRID middleware to wrap CAD applications with grid services • Report results to a result storage service caGrid Introduce facilitates service creation GUMS/CAMS is used to provide secure data communication and service invocation CAD algorithms provided by iCAD and Siemens Medical Solutions. Prototypes for investigational use only; not commercially available

  20. Framework Support Services • Result storage server — A distributed XML database for caching CAD results • GME — Manage communication and data protocols

  21. 14 5 17 18 12 15 16 Slice = 127 W/L = 2200/-500 User Interface Available data services DICOM image viewer Queried results Click to browse images, submit CAD analysis, and view results

  22. Motivating Application III: Integrative Cancer Biology Center on Epigenetics (PI Tim Huang, OSU) • TGFβ/Smad4 targets are regulated via epigenetic mechanisms. Promoter hypermethylation is a hallmark of this epigenetically mediated silencing. In this study, we have combined both chromatin immunoprecipitation microarray (ChIP-chip) and differential methylation hybridization (DMH) to investigate this epigenetic mechanism in ovarian cancer

  23. Translating a goal into workflow ArrayAnnotator

  24. Application of caGrid to the workflow • Application needs to support access to a diverse set of databases and execution of different analysis methods • Data services • KbSMAD • Chip information from chip company • Enzyme data • Clinical data • Experimental results • Experimental design • Analytical services • Designing a custom array • Normalization • Data mining (ex: clustering)

  25. Example: Prototype of Clone Annotation Analytical Service • Analytical Service: ArrayAnnotator • Goal: Provide a annotation for each clone to select a subset of clones among 400,000 candidate clones to design a custom array for DMH experiment • Clone selection criteria • Clones within a promoter region • Clones with proper internal and external cuts • Clones within CpG island region and/or high CG contents • Clones with Transcription Factor binding sites • Input: CloneType information • extended sequence, enzyme info, genomic location, etc • Functions • Determine external cut locations around a clone region (e.g., cut-site by BfaI) • Examine the internal cut around a clone region (e.g., cut-site by HapII, HinpII, and MCrBc) • Identify the location of clone in genome • Show ether it is within promoter region or not • Calculate CG content and overlapping with CpG islands • Identify which Transcription Factor binding sites are overlapped with clones

  26. Example caGrid Usage in P50 chip design application Clone Info Data Services Genome Sequence Data Source Result: List of clones Query (geneId) Result: extended genome sequence of clone 4 Query 3 Annotation Analytical Service 1 2 Request (cloneInfo) 5 6 Result: annotation (CpG, cutsite, promoter region, etc) Chip design application

  27. ArrayAnnotator output (Hao Sun, Ramana Davuluri)

  28. Multiscale Laboratory Research Group • Ohio State University • Joel Saltz • Gagan Agrawal • Umit Catalyurek • Dan Cowden • Mike Gray • Tahsin Kurc • Shannon Hastings • Steve Langella • Scott Oster • Tony Pan • DK Panda • Srini Parthasarathy • P. Sadayappan • Sivaramakrishnan (K2) • Michael Zhang • The Ohio Supercomputer Center • Stan Ahalt • Jason Bryan • Dennis Sessanna • Don Stredney • Pete Wycoff

  29. Microscopy Image Analysis • Biomedical Informatics • Tony Pan • Alexandra Gulacy • Dr. Metin Gurcan • Dr. Ashish Sharma • Dr. Kun Huang • Dr. Joel Saltz • Computer Science and Engineering • Kishore Mosaliganti • Randall Ridgway • Richard Sharp • Pathology • Dr. Dan Cowden • Human Cancer Genetics • Pamela Wenzel • Dr. Gustavo Leone • Dr. Alain deBruin

  30. caGrid Team • Booze | Allen | Hamilton • Manisundaram Arumani • National Cancer Institute • Peter Covitz • Krishnakant Shanbhag • SAIC • Tara Akhavan • Manav Kher • William Sanchez • Ruowei Wu • Jijin Yan • Ohio State University • Shannon Hastings • Tahsin Kurc • Stephen Langella • Scott Oster • Joel Saltz • Panther Informatics Inc. • Nick Encina • Brian Gilman

  31. RSNA 2005 Team Tony Pan, Stephen Langella, Shannon Hastings, Scott Oster, Ashish Sharma, Metin Gurcan, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The Ohio State University Medical Center, Columbus OH Eliot Siegel, Khan M. Siddiqui University of Maryland School of Medicine, Baltimore, MD Thanks to Siemens, ICAD for supplying CAD algorithms

More Related