1 / 16

Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

Scientific Reproducibility through Semantic Workflows and Shared Provenance Representations. Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi.edu http://www.isi.edu/~gil.

Download Presentation

Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific Reproducibility through Semantic Workflows andShared Provenance Representations Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi.edu http://www.isi.edu/~gil

  2. NSF Workshop on Challenges of Scientific Workflows [Gil et al IEEE Computer 2007] • Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science: • Reproducibility, key to scientific method, is threatened • Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential • What is missing: • Perceived importance of capturing and sharingprocess in accelerating pace of scientific advances • Process (method/protocol) is increasingly complex and highly distributed • Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself • Workflows need to be first class citizens in science CyberInfrastructure • Enable reproducibility • Accelerate scientific progress by automating processes • Interdisciplinary and intradisciplinary research challenges • Report available at http://www.isi.edu/nsf-workflows06

  3. Benefits of Workflow Systems [Taylor et al 07] • Managing execution • Remote job submission • Dependencies among steps • Failure recovery • Managing distributed computation • Move data when needed • Managing large data sets • Efficiency, reliability • Security and access control • Access to shared resources • Provenance recording • Low-cost high-fidelity reproducibility

  4. Capabilities Available Today: Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06]) • Input data: a site and an earthquake forecast model • thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed • ~110,000 rupture variations to be simulated for that site • High-level template combines 11 application codes • 8048 application nodes in the workflow instance generated by Wings • Provenance records kept for 100,000 workflow data products • Generated more than 2M triples of metadata • 24,135 nodes in the executable workflow generated by Pegasus, including: • data stage-in jobs, data stage-out jobs, data registration jobs • Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available • Including MPI jobs, each runs on hundreds of processors for 25-33 hours • Runtime was 1.9 CPU years

  5. The Wings/Pegasus Workflow System[Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming] WINGS: Semantic workflow environment wings.isi.edu • Knowledge-based reasoning on workflows and data (W3C’s OWL) • Semantic workflow catalogs • Automation and assistance • Execution-independent workflows Pegasus: Automated workflow refinement and execution pegasus.isi.edu • Optimize for performance, cost, reliability • Assign execution resources • Manage execution through DAGMan • Daily operational use in many domains Grid services condor.uwisc.edu www.globus.org • Secure and controlled sharing of distributed services, computing, data • Scalable service-oriented architecture • Commercial quality, open source

  6. Semantic Workflows in WINGS[Gil et al IEE IS 2010; Gil et al JETAI 2010; Gil et al eScience 2009; Kim et al JCCPE 2008; Gil et al 2007] • Semantic workflows: • More than a dataflow graph • Workflow variables: each constituent (node, link, component, dataset) has a corresponding variable • Semantic constraints on workflow variables, both within and across variables • Semantic descriptions of collections of of data and components are concisely represented (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]

  7. Workflow Portal for Genetic Studies of Mental Disorders (with E. Deelman and C. Mason) • Existing repository of genotypic and phenotypic information • Goal: develop workflows useful for data in the repository

  8. Designing a Workflow Collection for Population Genomics • Designed workflows for common analysis types • Association tests • CNV detection • Variant discovery • Family-based association analysis (TDT) • Developed workflow components by encapsulating widely-used heterogeneous open software • Plink (Purcell, Harvard) • R (Chambers et al) • PennCNV (Penn) -- Hidden Markov Models • Gnosis (State, Yale) -- sliding windows • Allegro (Decode, Iceland) -- Multiterminal Binary Decision Diagrams • Structure (Pritchard, Chicago) -- structured association • FastLink (Schaffer, NCBI) • (BWA) Burrows-Wheeler Aligner (Li * Durbin) • SAMTools

  9. Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming] Transmission Disequilibrium Test (TDT) Association Tests CNV Detection Variant Discovery from Resequencing

  10. Major Features • Workflow system manages set up and execution • Wings – set up • Pegasus - execution • Initial collection of workflows captures common genomic analyses • Users can upload their own datasets • Including collections of datasets • User data is secure • Not accessible by others

  11. Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]

  12. Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]

  13. Observations about Reproducibility with Workflows [Gil et al, forthcoming] Effort involved in reproducing results is minor 30 seconds to set up a workflow A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses Our workflows were independently developed and used “as is” Semantic representations abstract the analysis method from the software that implements it Our workflows used different analytic tools than the original studies Many implementations of same algorithm, some proprietary Semantic constraints can be added to workflows to avoid analysis errors Eg: in association analysis workflow, added constraint to remove duplicate individuals initially to avoid problems downstream

  14. Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: User assistance to correctly explore analysis “design space” Validation of analyses Automated generation of metadata Workflow retrieval and discovery “Conceptual” reproducibility

  15. W3C Provenance Group (Y. Gil, chair):Goals Provide state-of-the-art understanding and develop a roadmap for development and possible standardization • Articulate requirements for accessing and reasoning about provenance information • Develop use cases • Identify issues in provenance that are direct concern to the Semantic Web • Articulate relationships with other aspects of Web architecture • Report on state-of-the-art work on provenance • Report on a roadmap for provenance in the Semantic Web • Identify starting points for provenance representations • Identifying elements of a provenance architecture that would benefit from standardization

  16. W3C Provenance Group:Products of the Group to Date Group formed in September 2009, open to new members All information is public: http://www.w3.org/2005/Incubator/prov/wiki/ Developed a set of key dimensions for provenance (11/09) Grouped into three major categories: content, management, use Developed use cases for provenance (12/09) More than 30 use cases, including ~10 in science but others are relevant Developed requirements for provenance from use cases (1/10) User requirements: what is the purpose of the provenance information Technical requirements: derived from the user requirements Report on “Requirements for Provenance on the Web” Currently developing state-of-the-art report (expected 6/10) Started to develop recommendations (expected 9/10) Mappings across provenance vocabularies (eg: DC, OPM, SWAN,…)

More Related