210 likes | 331 Views
Monthly Program Update April 12, 2012 Andrew J. Buckler, MS Principal Investigator. With Funding Support provided by National Institute of Standards and Technology. Agenda. Working discussion on data curation , using facilities of Iterate for storage and provenance documentation model.
E N D
Monthly Program Update April 12, 2012 Andrew J. Buckler, MS Principal Investigator With Funding Support provided by National Institute of Standards and Technology
Agenda • Working discussion on data curation, using facilities of Iterate for storage and provenance documentation model. • Updates on: • Metrology Workshop results. • QIBA 3A Test bed progress. 2
Part of our discussion on data curation and processing workflow from last month… // Business Requirements FNIH, QIBA, and C-Path participants don’t have a way to provide precise specification for context for use and applicable assay methods (to allow semantic labeling): BiomarkerDB = Specify (biomarker domain expertise, ontology for labeling); Researchers and consortia don’t have an ability to exploit existing data resources with high precision and recall: ReferenceDataSet+ = Formulate (BiomarkerDB, {DataService} ); Technology developers and contract research organizations don’t have a way to do large-scale quantitative runs: ReferenceDataSet .CollectedValue+ = Execute (ReferenceDataSet.RawData); The community lacks way to apply definitive statistical analyses of annotation and image markup over specified context for use: BiomarkerDB.SummaryStatistic+ = Analyze ( { ReferenceDataSet .CollectedValue } ); Industry lacks standardized ways to report and submit data electronically: efiling transactions+ = Package (BiomarkerDB, {ReferenceDataSet} ); 3 3 3 3 3
…and the associated storage model… Reference Data Set Manager: Heavyweight Storage with URIs Knowledgebase: Lightweight Storage linking to URIs (using “Share” and “Duplicate” functions of RDSM to leverage cases across investigations) (self-generating knowledgebase from RDSM hierarchy and ISA-TAB description files) 4
…leading us to: Principles of Provenance Central to the scientific method is the idea of replicating prior experiments such that they are transparent and verifiable. • We need to keep track of • the origin of data • transformation methods applied to the data • not just which programs • version information is critical • copies of actual programs used (git). 5 5 5 5 5
Taverna keeps provenance data in a database on the machine from which the workflow is initiated • We need to expose provenance for external users of QI-Bench • example: provenance of the data in an exported ISA-TAB 6 6 6 6 6
Taverna allows access to the provenance data via a Java API. • We have not explored this area of Taverna yet. • Taverna’sdocumentation indicates this is an area under active development. 8 8 8 8 8
IterateDemonstration • Obtaining a list of communities to which a user belongs • Nesting a workflow • Listing the items in a folder 9 9 9 9 9
Workflow to list community memberships in Iterate using a nested workflow 11
Provenance application in QI-Bench Demonstrators: Investigation and Studies level (ISA-TAB compliant) 12
Provenance application in QI-Bench Demonstrators: Assay and Data levels (not ISA-TAB compliant yet) 13
Application • Provenance of • Demonstrator40 data [input for analysis] • Demonstrator40 Output [obviously the output] • … 14 14 14 14 14
Application • So we can answer • What is Demonstrator40_download.zip? • How did we get the Demonstrator40 data? • What was the original dataset and where did it come from? • What transformation on the original dataset created the Demonstrator40 data folder? 15 15 15 15 15
Value proposition of QI-Bench • Efficiently collect and exploit evidence establishing standards for optimized quantitative imaging: • Users want confidence in the read-outs • Pharma wants to use them as endpoints • Device/SW companies want to market products that produce them without huge costs • Public wants to trust the decisions that they contribute to • By providing a verification framework to develop precompetitive specifications and support test harnesses to curate and utilize reference data • Doing so as an accessible and open resource facilitates collaboration among diverse stakeholders 19
Summary:QI-Bench Contributions • We make it practical to increase the magnitude of data for increased statistical significance. • We provide practical means to grapple with massive data sets. • We address the problem of efficient use of resources to assess limits of generalizability. • We make formal specification accessible to diverse groups of experts that are not skilled or interested in knowledge engineering. • We map both medical as well as technical domain expertise into representations well suited to emerging capabilities of the semantic web. • We enable a mechanism to assess compliance with standards or requirements within specific contexts for use. • We take a “toolbox” approach to statistical analysis. • We provide the capability in a manner which is accessible to varying levels of collaborative models, from individual companies or institutions to larger consortia or public-private partnerships to fully open public access. 20
QI-BenchStructure / Acknowledgements • Prime: BBMSC (Andrew Buckler, Gary Wernsing, Mike Sperling, Matt Ouellette) • Co-Investigators • Kitware (Rick Avila, Patrick Reynolds, JulienJomier, Mike Grauer) • Stanford (David Paik) • Financial support as well as technical content: NIST (Mary Brady, Alden Dima, John Lu) • Collaborators / Colleagues / Idea Contributors • Georgetown (Baris Suzek) • FDA (Nick Petrick, Marios Gavrielides) • UMD (Eliot Siegel, Joe Chen, Ganesh Saiprasad, Yelena Yesha) • Northwestern (Pat Mongkolwat) • UCLA (Grace Kim) • VUmc (Otto Hoekstra) • Industry • Pharma: Novartis (Stefan Baumann), Merck (Richard Baumgartner) • Device/Software: Definiens, Median, Intio, GE, Siemens, Mevis, Claron Technologies, … • Coordinating Programs • RSNA QIBA (e.g., Dan Sullivan, Binsheng Zhao) • Under consideration: CTMM TraIT (Andre Dekker, JeroenBelien) 21