1 / 27

Metadata Meets Semantic Workflows

Metadata Meets Semantic Workflows. Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California http://www.isi.edu/~gil With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz, Paul Groth, Gonzalo Florez,

cecil
Download Presentation

Metadata Meets Semantic Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California http://www.isi.edu/~gil With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz, Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody

  2. Outline • Brief introduction to computational workflows • Brief overview of semantic workflows • The Wings/Pegasus workflow system • Five benefits of semantic workflows • Reproducibility • Validation • Metadata generation • Data discovery • Workflow discovery

  3. Scientific Data Analysis Complex processes involving a variety of algorithms/software

  4. NSF Workshop on Challenges of Scientific Workflows [Gil et al, IEEE Computer 2007] • Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science: • Reproducibility, key to scientific method, is threatened • Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential • What is missing: • Perceived importance of capturing and sharingprocess in accelerating pace of scientific advances • Process (method/protocol) is increasingly complex and highly distributed • Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself • Workflows need to be first class citizens in science CyberInfrastructure • Enable reproducibility • Accelerate scientific progress by automating processes • Interdisciplinary and intradisciplinary research challenges • Report available at http://www.isi.edu/nsf-workflows06

  5. Benefits of Workflow Systems [Taylor et al 07] • Managing execution • Dependencies among steps • Failure recovery • Managing distributed computation • Move data when needed • Managing large data sets • Efficiency, reliability • Security and access control • Remote job submission • Provenance recording • Low-cost high-fidelity reproducibility

  6. Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06]) • Input data: a site and an earthquake forecast model • thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed • ~110,000 rupture variations to be simulated for that site • High-level template combines 11 application codes • 8048 application nodes in the workflow instance generated by Wings • Provenance records kept for 100,000 workflow data products • Generated more than 2M triples of metadata • 24,135 nodes in the executable workflow generated by Pegasus, including: • data stage-in jobs, data stage-out jobs, data registration jobs • Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available • Including MPI jobs, each runs on hundreds of processors for 25-33 hours • Runtime was 1.9 CPU years

  7. Semantic Workflows in WINGS Workflow templates Dataflow diagram Each constituent (node, link, component, dataset) has a corresponding variable • Semantic properties • Constraints on workflow variables (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)

  8. Semantic Constraints as Metadata Properties Constraints on reusable template (shown below) Constraints on current user request (shown above) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]

  9. Why Semantic Workflows:1) Easily Replicate Previously Published Results • A catalog of carefully crafted workflows of select state-of-the-art methods to cover a wide range of common analyses • Many implementations of same algorithm, some proprietary • Same implementation but new versions and bug fixes • Semantic workflows abstract from software implementation • Representing abstract classes of software components • Instances are the implemented codes • Workflow steps refer to component classes • Representing abstract kinds of data (eg exclude format) • Semantic reasoning needed to specialize workflow • To map the abstract workflow into an execution-ready workflow • To insert lower level steps (eg data transformations)

  10. The Importance of Reproducibility

  11. Difficulties in Replication • Some software is proprietary • Effort must be invested in data conversions • Software installation • Managing new versions

  12. Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming] Work with Christopher Mason from Cornell University Transmission Disequilibrium Test (TDT) Association Tests CNV Detection Variant Discovery from Resequencing

  13. Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]

  14. Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]

  15. Observations [Gil et al, forthcoming] Effort involved in reproducing results is minor 30 seconds to set up a workflow A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses Our workflows were independently developed and used “as is” Semantic representations abstract the analysis method from the software that implements it Our workflows used different analytic tools than the original studies Semantic constraints can be added to workflows to avoid analysis errors Our workflow removes duplicate individuals that would cause problems in the association analysis

  16. Why Semantic Workflows:2) Ensure Correct Use of State-of-the-Art Methods Analytic software and methods are well documented but all is text (papers, manuals, etc) Time consuming, hard to spot interdependencies, no validation Semantic workflows can check constraints and guide users Representing requirements of software components Constraints on input data Constraints on parameter settings given properties of input data Representing metadata properties of datasets Semantic reasoning needed: To check constraints of each workflow step To propagate constraints across the workflow

  17. User’s Difficulties: Choosing Parameters How do I set up the workflow parameters? Association Test Max Population Max individuals per cluster (“mc”) and merge distance p-value constraint (“ppc”) If Affimetrix data, set cutoff (“miss”) to 94%, if Illumina 98%

  18. Wings Workflow System Assists Users to Set Up Parameters Based on Characteristics of Datasets Component Catalog [MissingnessPerIndividual1: (?c rdf:type pcdom:Create_Binary_PEDFile_Class) (?c pc:hasInput ?idv1) (?idv1 pc:hasArgumentID "PEDFile") (?c pc:hasInput ?idv2) (?idv2 pc:hasArgumentID "MissingnessPerIndividual") (?idv1 dcdom:hasGenotypingRate ?v1) equal(?v1, "0.95"^^xsd:float) -> (?idv2 pc:hasValue "0.06"^^xsd:float)]

  19. Why Semantic Workflows:3) Automatic Generation of Metadata Metadata annotations are tedious and involved Often not done, an obstacle to sharing and to reuse Semantic workflows can automate the generation of metadata for analysis data products Representing expected characteristics of output dataset for each software component given the input metadata Representing metadata properties of input datasets Semantic reasoning needed: To propagate metadata for each workflow step To propagate metadata across the workflow

  20. Wings Metadata Generation: An Example in a Seismic Hazard Workflow [Kim et al 06; Gil et al 07] • 127_6.txt.variation • -s0000-h0000 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0000 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0000 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • FD_SGT/PAS_1/A/SGT161 • - site_name: PAS • tensor_direction: 1 • time_period: A • xyz_volumn_id: 161 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_realization_#:0 • hypo_center_#: 1 127_6.rvm - source_id: 127 - rupture_id: 6 RVM … … Rupture_variation Rupture_variation SGT SGT SeismogramGration Seismogram • Seismogram_PAS_127_6.grm • site_name: PAS • source_id: 127 • rupture_id: 6

  21. Wings Workflows for Accuracy/Quality Tradeoffs in Biomedical Image Analysis [Kumar et al 09] PIQ: Pixel Intensity Quantification (from National Center for Microscopy and Imaging Research [Chow et al 06]) Terabyte-sized out-of-core image data Need to minimize execution time while preserving highest output quality Some operations are parallelizable, others must operate on entire images For efficiency, image decomposed (layers, tiles, and chunks) but quality is affected From a workflow template, Wings can automatically generate descriptions of each individual piece of the image to manage the computations over each one

  22. Why Semantic Workflows:4) Discovery of Relevant Data Need a dataset of updated common (known) loci to annotate findings, where can I find one?

  23. Why Semantic Workflows:5) Retrieval of Workflows • Hard to find workflows for the type of analysis a user wants • Semantic information is not provided when creating the workflow • e.g., when user adds a NaiveBayesModeler, he wouldn’t be expected to define that the output of this would be a NaiveBayesModel or a Bayes Model (superclass) or not human readable • However, retrieval queries are often based on metadata properties of data • e.g., “Find workflows that can normalize data which is continuous and has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]” • Semantic representations are needed • For workflow constituents • Metadata properties of input, intermediate and final data products • Metadata properties of workflow and component function • For user queries • Express workflow sketches containing partial data descriptions (constraints) • Reasoning capabilities • Automatic creation of metadata for expected workflow data products • Workflow matching to queries (exact and partial)

  24. User’s Difficulties: Choosing an Analysis • What type of analysis is appropriate for my data? Association Test Variant Discovery from Resequencing Association tests are best for large datasets that are not within a family Variant discovery is used for genomic data from the same individual CNV Detection Transmission Disequilibrium Test (TDT) TDT analysis requires no less than 100 families

  25. User’s Difficulties: Choosing a Workflow What workflow is appropriate for my goals? Association Test Applies population stratification to remove outliers Assumes outliers have been removed Uses CMH association Uses structured association Transmission Disequilibrium Test (TDT) Uses a standard test Incorporates parental phenotype information

  26. An Algorithm for Semantic Enrichment of Workflow Templates [Gil et al K-CAP 09] Problem Addressed: Semantic information is not provided when creating the workflow, but retrieval queries use it Key idea: Constraints can be available in a component catalog and propagated through the workflow Phase 1: Goal Regression Starting from final products, traverse workflow backwards For each node, query component catalog for metadata constraints on inputs Phase 2: Forward Projection Starting from input datasets, traverse workflow forwards For each node, query component catalog for metadata constraints on outputs ?TrainingData dcdom:isDiscrete true ?Dataset3 dcdom:isDiscrete true ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7 ?Model5 dcdom:isDiscrete true ?Model6 dcdom:isDiscrete true ?Model7 dcdom:isDiscrete true ?TestData dcdom:isDiscrete true

  27. Conclusions: Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: “Conceptual” reproducibility User assistance to explore analysis “design space” Validation of analyses Automated generation of metadata Workflow retrieval and discovery

More Related